Support Vector Machines

Support Vector Machines Some slides adapted from •Aliferis & Tsamardinos, Vanderbilt University http://discover1.mc.vanderbilt.edu/discover/public/ml_tutorial_ol d/index.html •Rong Jin, Language Technology Institute www.contrib.andrew.cmu.edu/~jin/ir_proj/svm.ppt 1 ABDBM © Ron Shamir Support Vector Machines • Decision surface: a hyperplane in feature space • One of the most important tools in the machine learning toolbox • In a nutshell: – map the data to a predetermined very high- dimensional space via a kernel function – Find the hyperplane that maximizes the margin between the two classes – If data are not separable - find the hyperplane that maximizes the margin and minimizes the (weighted average of the) misclassifications ABDBM © Ron Shamir 2 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 3 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 4 Which Separating Hyperplane to Use? Var1 Var2 ABDBM © Ron Shamir 5 Maximizing the Margin Var 1 IDEA 1: Select the separating hyperplane that maximizes the margin! Margin Width Margin Width Var2 ABDBM © Ron Shamir 6 Support Vectors Var1 Support Vectors Margin Width Var2 ABDBM © Ron Shamir 7 Setting Up the Optimization Problem Var1 The width of the margin is: 2 k w w x b k So, the problem is: w 2 k max w w x b k k Var2 s. t . ( w x b ) k , x of class 1 w x b 0 (w x b ) k , x of class 2 ABDBM © Ron Shamir 8 Setting Up the Optimization Problem Var1 Scaling w, b so that k=1, the problem becomes: 2 max w x b 1 w w s. t . ( w x b ) 1, x of class 1 (w x b ) 1, x of class 2 w x b 1 1 Var2 w x b 0 ABDBM © Ron Shamir 9 Setting Up the Optimization Problem • If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrite (w xi b ) 1, x i with y i 1 (w xi b ) 1, x i with y i 1 • as yi( w x i b ) 1, x i • So the problem becomes: 2 1 2 max min w w or 2 s. t . y( w x b ) 1, x s. t . yi( w x i b ) 1, x i i i i ABDBM © Ron Shamir 10 Linear, Hard-Margin SVM Formulation • Find w,b that solve • Quadratic program: quadratic objective, linear (in)equality constraints • Problem is convex there is a unique global minimum value (when feasible) • There is also a unique minimizer, i.e. w and b values that provide the minimum • No solution if the data are not linearly separable 1 2 • Objective is PD polynomial-timemin solnw • Very efficient soln with modern optimization2 s. t . y( w x b ) 1, x software (handles 1000s of constraintsi iand i training instances). ABDBM © Ron Shamir 11 Lagrange multipliers 1 2 Minimize (w) || w || i yi (w xi b) i w,b, 2 s.t.1 0,...,l 0 • Convex quadratic programming problem • Duality theory applies! ABDBM © Ron Shamir 12 Dual Space • Dual Problem 1 Maximize F(Λ) Λ 1 Λ DΛ 2 Λ y 0 subject to Λ 0 where Λ (1,2 ,...,l ),y ( y1, y2 ,...,yn ),Dij yi y jxi x j • Representation for w l w i yixi • Decision function i 1 l f (x) sign( yii (x xi ) b) i 1 ABDBM © Ron Shamir 13 Comments • Representation of vector w – Linear combination of examples xi • # parameters = # examples – i: the importance of each examples • Only the points closest to the bound have i0 • Core of the algorithm: xx’ – Both matrix D and decision function require the knowledge of xx’ Dij yi y jxi x j (More on this soon) l w i yixi i 1 ABDBM © Ron Shamir 14 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 15 Non-Linearly Separable Data Var1 Introduce slack variables i i Allow some instances to fall j w x b 1 within the margin, w but penalize them w x b 1 1 Var w x b 0 2 ABDBM © Ron Shamir 16 Formulating the Optimization Problem Constraint becomes : Var1 yi( w x i b ) 1 i , x i i i 0 Objective function penalizes for j w x b 1 w misclassified instances and those within the margin w x b 1 1 1 2 min wC Var2 i w x b 0 2 i C trades-off margin width & misclassifications ABDBM © Ron Shamir 17 Linear, Soft-Margin SVMs yi( w x i b ) 1 i , x i 0 • Algorithm tries to keep i at zero whilei maximizing margin • Alg does not minimize the no. of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes 2 • Other formulations use i instead • C: penalty for misclassification 1 2 • As C, we get closer to the hard-marginmin solutionwC i 2 i ABDBM © Ron Shamir 18 Dual Space • Dual Problem 1 Maximize F(Λ) Λ 1 Λ DΛ 2 Λ y 0 subject to Λ 0 Λ C1 where Λ (1,2 ,...,l ),y ( y1, y2 ,...,yn ),Dij yi y jxi x j • Only difference: upper bound C on i • Representation for w l w i yixi • Decision function i 1 l f (x) sign( yii (x xi ) b) i 1 ABDBM © Ron Shamir 19 Comments • Param C – Controls the range of i avoids over emphasizing some examples – i (C - i) = 0 (“complementary slackness”) – C can be extended to be case-dependent • Weight i – i < C i = 0 i-th example is correctly classified not quite important – i = C i can be nonzero i-th training example may be misclassified very important ABDBM © Ron Shamir 20 Robustness of Soft vs Hard Margin SVMs Var Var1 1 i i Var Var2 2 w x b 0 Soft Margin SVM Hard Margin SVM ABDBM © Ron Shamir 21 Soft vs Hard Margin SVM • Soft-Margin always has a solution • Soft-Margin is more robust to outliers – Smoother surfaces (in the non-linear case) • Hard-Margin does not require to guess the cost parameter (requires no parameters at all) ABDBM © Ron Shamir 22 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 23 Disadvantages of Linear Decision Surfaces Var1 Var2 ABDBM © Ron Shamir 24 Advantages of Non-Linear Surfaces Var1 Var2 ABDBM © Ron Shamir 25 Linear Classifiers in High- Dimensional Spaces Constructed Var 1 Feature 2 Var 2 Constructed Feature 1 Find function (x) to map to a different space ABDBM © Ron Shamir 26 Mapping Data to a High- Dimensional Space • Find function (x) to map to a different space, then SVM formulation becomes: s. t . yi( w ( x i ) b ) 1 i , x i i 0 • Data appear as (x), weights w are now weights in the new space • Explicit mapping expensive if (x) is1 very2 min wC high dimensional i 2 i • Can we solve the problem without explicitly mapping the data ? ABDBM © Ron Shamir 27 The Dual of the SVM Formulation • Original SVM formulation 1 2 – n inequality constraints min w C i w,b – n positivity constraints 2 i – n number of variables s.t. yi (w (x) b) 1 i , xi • The (Wolfe) dual of this i 0 problem – one equality constraint 1 – n positivity constraints min i j yi y j ((xi ) (x j )) i ai 2 – n number of variables i, j i (Lagrange multipliers) s.t. C 0, x – Objective function more i i complicated i yi 0 i • But: Data only appear as (xi) (xj) ABDBM © Ron Shamir 28 The Kernel Trick t • (xi) (xj) means: map data into new space, then take the inner product of the new vectors • Suppose we can find a function such that: K(xi , xj) = t (xi) (xj) , i.e., K is the inner product of the images of the data • For training, no need to explicitly map the data into the high-dimensional space to solve the optimization problem • How do we classify without explicitly mapping the new instances? Turns out sgn(wx b) sgn(i yi K(xi , x) b) i where b solves j ( y j i yi K(xi , x j ) b 1) 0, i for any j with j 0 ABDBM © Ron Shamir 29 Examples of Kernels • Assume we measure x1,x2 and we use the mapping: : x , x {,,2 x22 x x x ,2,2,1} x x 1 21 2 1 2 1 2 • Consider the function: K(x, z) (x z 1)2 • Then: t 2 2 2 2 • 휙 푥 휙 푧 = 푥1 푧1 + 푥2 푧2 + 2푥1푥2푧1푧2 + 2 2푥1푧1 + 2푥2푧2 + 1 = 푥1푧1 + 푥2푧2 + 1 = 푥 ⋅ 푧 + 1 2 = 퐾(푥, 푧) ABDBM © Ron Shamir 30 Polynomial and Gaussian Kernels K(x, z) (x z 1) p is called the polynomial kernel of degree p.

Support Vector Machines

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support