Support Vector Machines

Support Vector Machines

Support Vector Machines Andrew Nobel April, 2020 Support Vector Machines (SVM) I Simplest case: linear classification rule I Generalizes to non-linear rules through feature maps and kernels I Good off-the-shelf method for high dimensional data, widely used I Begins with geometry rather than a statistical model I Close connections with convex programming I Early bridge between machine learning and optimization Linear Classification Rules p Setting: Labeled pairs (x; y) with predictor x 2 R and class y 2 f+1; −1g p Recall: Given w 2 R and b 2 R linear classification rule has form ( +1 if wtx ≥ b φ(x) = sign(xtw − b) = −1 if wtx < b Decision boundary of φ equal to hyperplane H = fx : wtx = bg. Idea: Given pair (x; y) ask two questions I Correctness: Is φ(x) = y? I Confidence: How far is x from decision boundary H? Distance to Decision Boundary Fact: For every x the ratio (xtw − b)=jjwjj = signed distance from x to H Margin Definition: The margin of the rule φ at a pair (x; y) is given by xtw − b m (x; y) = y φ jjwjj Idea: Margin measures overall fit of φ at (x; y) I mφ(x; y) > 0 iff φ(x) = y iff x on correct side of H I mφ(x; y) < 0 iff φ(x) 6= y iff x on wrong side of H I jmφ(x; y)j = distance from x to H (from Fact above) Max Margin Classifiers and Linearly Separability *Goal*: In fitting a linear rule to data, we would like the margins of all the data points to be large and positive, if possible. Two cases 1. Data is linearly separable ) max margin classifier 2. Data is not linearly separable ) soft margin classifier Definition: Dn = (x1; y1);:::; (xn; yn) is linearly separable if there is a hyperplane H separating fxi : yi = 1g and fxi : yi = −1g Linearly Separable Data: Multiple Hyperplanes Max Margin Classifier (from ISL) 2 X −1 0 1 2 3 −1 0 1 2 3 X1 Maximizing the Minimum Margin Max Margin Classifier: If Dn is linearly separable, find w and b to maximize the minimum margin of φ(x) = sign(xtw − b) t xiw − b max Γ(w; b) where Γ(w; b) = min yi w;b 1≤i≤n jjwjj Fact: Solution of maxw;b Γ(w; b) is the solution of the convex program ∗ 1 2 t p = min jjwjj subject to yi(xiw − b) ≥ 1 for i = 1; : : : ; n w;b 2 Finding p∗ is called the primal problem Solving the Problem of Maximizing the Minimum Margin Approach: Solve primal problem using techniques of convex optimization: the Lagrangian function and duality p n Step 1: Define the Lagrangian L : R × R × R+, with R+ = [0; 1), by n 1 2 X t L(w; b; λ) := jjwjj − λ fy (w x − b) − 1g 2 i i i i=1 Note: Lagrangian combines objective and constraints into a single function. New variables λi called Lagrange multipliers. The Lagrangian and Duality Key idea: Lagrangian turns primal problem into min-max problem. Note that ( jjwjj2 if constraints satisfied max L(w; b; λ) = λ≥0 +1 otherwise Thus the primal problem can be written as p∗ = min max L(w; b; λ) w;b λ≥0 Dual: Changing order of min and max yields the dual problem d∗ = max min L(w; b; λ) λ≥0 w;b Lagrangian and Duality Definition: For each λ ≥ 0 define the Lagrange dual function L~(λ) = min L(w; b; λ) w;b Fact I The dual function L~(λ) is concave and has a global maximum ∗ I Thus the dual problem d = maxλ≥0 L~(λ) has a solution ∗ ∗ ∗ ∗ I In general, d ≤ p . Difference p − d ≥ 0 called duality gap ∗ ∗ I For our problem, d = p , that is, duality gap is zero. Lagrangian Dual Function and Dual Problem Step 2: Fix λ ≥ 0 and minimize L(w; b; λ) over w; b. Differentiation gives n n X X w = λi yi xi and λi yi = 0 i=1 i=1 Substituting these equations into L(w; b; λ) yields quadratic dual function n n X 1 X L~(λ) = λ − λ λ y y hx ; x i i 2 i j i j i j i=1 i;j=1 Step 3: Solve convex dual problem using quadratic programming n X max L~(λ) s.t. λi yi = 0 and λ1; : : : ; λn ≥ 0 i=1 Solving the Problem of Maximizing the Minimum Margin Step 4: Combine solution λ of dual problem and optimality conditions to get desired values of w and b n X 1 t t w = λi yi xi b = min xiw + max xiw 2 i:yi=1 i:yi=−1 i=1 Upshot: Maximum margin classification rule φ(x) = sign(h(x)) where n t X h(x) = x w − b = λi yi hxi; xi − b i=1 Note: Feature vectors xi affect rule φ only through inner products I Dual L~(λ) depends on xi’s only through inner products hxi; xj i I Function h(x) depends on xi’s only through inner products hxi; xi Support Vectors KKT Conditions: For i = 1; : : : ; n optimal w, b, and λ are such that ?λ i [yih(xi) − 1] = 0 ) λi = 0 or yih(xi) = 1 Let S = fi : λi > 0g. Note that P I h(x) = i2S λi yi hxi; xi − b I i 2 S ) yih(xi) = 1 ) xi on max margin hyperplane for class yi Definition: Training vectors xi with i 2 S called support vectors I Changing a support vector with other data fixed would change the decision boundary Extending SVM to Non-Separable Case Most data sets not linearly separable: no hyperplane can separate ±1’s 2 X −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 X1 Question: How to extend maximum margin classifiers to this setting? SVM: Non-Separable Case Idea: Reformulate primal problem. For fixed C > 0 solve convex program ( n ) 1 2 X min jjwjj + C ξi w,b,ξ 2 i=1 t s.t. yi(xiw − b) ≥ 1 − ξi and ξi ≥ 0 Here ξ1; : : : ; ξn are slack variables t I ξi captures slack in hard constraint yi(xiw − b) ≥ 1 t I ξi is distance of prediction xiw − b from optimal halfspace relative to margin size I Cξi measures associated cost SVM: Non-Separable Case, cont. 1 2 Pn Minimizing objective function 2 jjwjj + C i=1 ξi entails tradeoff between 2 I making jjwjj small, enlarging margin around separating hyperplane Pn I making i=1 ξi small, minimizing margin violations relative to margin size I parameter C controls the tradeoff Slack Variables and Margins Consider linear function h(x) = xtw − b, associated rule φ(x) = sign(h(x)) I Separating hyperplane H = fx : h(x) = 0g + − I Target half spaces H = fx : h(x) ≥ 1g and H = fx : h(x) ≤ −1g Consider data point (xi; yi) with mi = yih(xi). Three cases yi 1. If mi ≥ 1 then φ(xi) = yi and xi 2 H , slack ξi = 0 yi 2. If 0 ≤ mi < 1 then φ(xi) = yi but xi 62 H , slack ξi = 1 − mi 2 (0; 1] yi 3. If mi < 0 then φ(xi) 6= yi and xi 62 H , slack ξi = 1 − mi > 1 Finding Soft Margin Classifier Solve primal problem using same Lagrangian-dual method as separable case Lagrangian of primal problem is n n n 1 2 X X t X L(w; b; ξ; λ, γ) := jjwjj + C ξ − λ fy (w x −b)−1+ξ g − γ ξ 2 i i i i i i i i=1 i=1 i=1 Dual problem is given by ( n n ) X 1 X max λ − λ λ y y hx ; x i i 2 i j i j i j i=1 i;j=1 n X subject to 0 ≤ λi ≤ C and λiyi = 0 i=1 Soft Margin Classifier Finally: Combine solution λ of dual problem and KKT optimality conditions to obtain support set S = fi : λi > 0g and optimal w; b X w = λi yi xi b = function of λ and data i2S Upshot: Optimal soft margin classification rule φ(x) = sign(h(x)) where t X h(x) = x w − b = λi yi hxi; xi − b i2S Again: Rule φ depends on feature vectors xi; x only through inner products I L~(λ) depends on xi’s only through hxi; xj i I h(x) depends on xi’s only through hxi; xi Effect of Parameter C 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −1 0 1 2 −1 0 1 2 X1 X1 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −1 0 1 2 −1 0 1 2 X1 X1 Figure: SVM with small C (the top left) to large C (bottom right). Data non-separable. (From ISL) Nonlinear SVM: Background Note: Inner product hx; x0i is signed measure of similarity between x and x0 0 0 0 I hx; x i = jjxjj jjx jj if x; x point in same direction 0 0 I hx; x i = 0 if x; x are orthogonal 0 0 0 I hx; x i = −||xjj jjx jj if x; x point in opposite directions Goal: Enhance and expand applicability of standard SVM I Map predictors x to new feature space via nonlinear transformation I Classify data using similarity between transformed features I In many cases new features space is high dimensional Direct Approach to Nonlinear SVM: Feature Maps Given: Data (x1; y1);:::; (xn; yn) 2 X × {±1g d I Define feature map γ : X! R taking predictors to HD features I Apply SVM to observations (γ(x1); y1);:::; (γ(xn); yn) Pn I SVM classifier is sign of h(x) = i=1 λi yi hγ(xi); γ(x)i − b Example 1: Two-way interactions (polynomials of degree two) p I Predictor space X = R d I Define feature map γ : X! R by γ(x) = (xi xj )1≤i;j≤p 0 2 I Computing hγ(x); γ(x )i requires d = p operations.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    29 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us