Support Vector Machines
Total Page:16
File Type:pdf, Size:1020Kb
Support Vector Machines Andrew Nobel April, 2020 Support Vector Machines (SVM) I Simplest case: linear classification rule I Generalizes to non-linear rules through feature maps and kernels I Good off-the-shelf method for high dimensional data, widely used I Begins with geometry rather than a statistical model I Close connections with convex programming I Early bridge between machine learning and optimization Linear Classification Rules p Setting: Labeled pairs (x; y) with predictor x 2 R and class y 2 f+1; −1g p Recall: Given w 2 R and b 2 R linear classification rule has form ( +1 if wtx ≥ b φ(x) = sign(xtw − b) = −1 if wtx < b Decision boundary of φ equal to hyperplane H = fx : wtx = bg. Idea: Given pair (x; y) ask two questions I Correctness: Is φ(x) = y? I Confidence: How far is x from decision boundary H? Distance to Decision Boundary Fact: For every x the ratio (xtw − b)=jjwjj = signed distance from x to H Margin Definition: The margin of the rule φ at a pair (x; y) is given by xtw − b m (x; y) = y φ jjwjj Idea: Margin measures overall fit of φ at (x; y) I mφ(x; y) > 0 iff φ(x) = y iff x on correct side of H I mφ(x; y) < 0 iff φ(x) 6= y iff x on wrong side of H I jmφ(x; y)j = distance from x to H (from Fact above) Max Margin Classifiers and Linearly Separability *Goal*: In fitting a linear rule to data, we would like the margins of all the data points to be large and positive, if possible. Two cases 1. Data is linearly separable ) max margin classifier 2. Data is not linearly separable ) soft margin classifier Definition: Dn = (x1; y1);:::; (xn; yn) is linearly separable if there is a hyperplane H separating fxi : yi = 1g and fxi : yi = −1g Linearly Separable Data: Multiple Hyperplanes Max Margin Classifier (from ISL) 2 X −1 0 1 2 3 −1 0 1 2 3 X1 Maximizing the Minimum Margin Max Margin Classifier: If Dn is linearly separable, find w and b to maximize the minimum margin of φ(x) = sign(xtw − b) t xiw − b max Γ(w; b) where Γ(w; b) = min yi w;b 1≤i≤n jjwjj Fact: Solution of maxw;b Γ(w; b) is the solution of the convex program ∗ 1 2 t p = min jjwjj subject to yi(xiw − b) ≥ 1 for i = 1; : : : ; n w;b 2 Finding p∗ is called the primal problem Solving the Problem of Maximizing the Minimum Margin Approach: Solve primal problem using techniques of convex optimization: the Lagrangian function and duality p n Step 1: Define the Lagrangian L : R × R × R+, with R+ = [0; 1), by n 1 2 X t L(w; b; λ) := jjwjj − λ fy (w x − b) − 1g 2 i i i i=1 Note: Lagrangian combines objective and constraints into a single function. New variables λi called Lagrange multipliers. The Lagrangian and Duality Key idea: Lagrangian turns primal problem into min-max problem. Note that ( jjwjj2 if constraints satisfied max L(w; b; λ) = λ≥0 +1 otherwise Thus the primal problem can be written as p∗ = min max L(w; b; λ) w;b λ≥0 Dual: Changing order of min and max yields the dual problem d∗ = max min L(w; b; λ) λ≥0 w;b Lagrangian and Duality Definition: For each λ ≥ 0 define the Lagrange dual function L~(λ) = min L(w; b; λ) w;b Fact I The dual function L~(λ) is concave and has a global maximum ∗ I Thus the dual problem d = maxλ≥0 L~(λ) has a solution ∗ ∗ ∗ ∗ I In general, d ≤ p . Difference p − d ≥ 0 called duality gap ∗ ∗ I For our problem, d = p , that is, duality gap is zero. Lagrangian Dual Function and Dual Problem Step 2: Fix λ ≥ 0 and minimize L(w; b; λ) over w; b. Differentiation gives n n X X w = λi yi xi and λi yi = 0 i=1 i=1 Substituting these equations into L(w; b; λ) yields quadratic dual function n n X 1 X L~(λ) = λ − λ λ y y hx ; x i i 2 i j i j i j i=1 i;j=1 Step 3: Solve convex dual problem using quadratic programming n X max L~(λ) s.t. λi yi = 0 and λ1; : : : ; λn ≥ 0 i=1 Solving the Problem of Maximizing the Minimum Margin Step 4: Combine solution λ of dual problem and optimality conditions to get desired values of w and b n X 1 t t w = λi yi xi b = min xiw + max xiw 2 i:yi=1 i:yi=−1 i=1 Upshot: Maximum margin classification rule φ(x) = sign(h(x)) where n t X h(x) = x w − b = λi yi hxi; xi − b i=1 Note: Feature vectors xi affect rule φ only through inner products I Dual L~(λ) depends on xi’s only through inner products hxi; xj i I Function h(x) depends on xi’s only through inner products hxi; xi Support Vectors KKT Conditions: For i = 1; : : : ; n optimal w, b, and λ are such that ?λ i [yih(xi) − 1] = 0 ) λi = 0 or yih(xi) = 1 Let S = fi : λi > 0g. Note that P I h(x) = i2S λi yi hxi; xi − b I i 2 S ) yih(xi) = 1 ) xi on max margin hyperplane for class yi Definition: Training vectors xi with i 2 S called support vectors I Changing a support vector with other data fixed would change the decision boundary Extending SVM to Non-Separable Case Most data sets not linearly separable: no hyperplane can separate ±1’s 2 X −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 0 1 2 3 X1 Question: How to extend maximum margin classifiers to this setting? SVM: Non-Separable Case Idea: Reformulate primal problem. For fixed C > 0 solve convex program ( n ) 1 2 X min jjwjj + C ξi w,b,ξ 2 i=1 t s.t. yi(xiw − b) ≥ 1 − ξi and ξi ≥ 0 Here ξ1; : : : ; ξn are slack variables t I ξi captures slack in hard constraint yi(xiw − b) ≥ 1 t I ξi is distance of prediction xiw − b from optimal halfspace relative to margin size I Cξi measures associated cost SVM: Non-Separable Case, cont. 1 2 Pn Minimizing objective function 2 jjwjj + C i=1 ξi entails tradeoff between 2 I making jjwjj small, enlarging margin around separating hyperplane Pn I making i=1 ξi small, minimizing margin violations relative to margin size I parameter C controls the tradeoff Slack Variables and Margins Consider linear function h(x) = xtw − b, associated rule φ(x) = sign(h(x)) I Separating hyperplane H = fx : h(x) = 0g + − I Target half spaces H = fx : h(x) ≥ 1g and H = fx : h(x) ≤ −1g Consider data point (xi; yi) with mi = yih(xi). Three cases yi 1. If mi ≥ 1 then φ(xi) = yi and xi 2 H , slack ξi = 0 yi 2. If 0 ≤ mi < 1 then φ(xi) = yi but xi 62 H , slack ξi = 1 − mi 2 (0; 1] yi 3. If mi < 0 then φ(xi) 6= yi and xi 62 H , slack ξi = 1 − mi > 1 Finding Soft Margin Classifier Solve primal problem using same Lagrangian-dual method as separable case Lagrangian of primal problem is n n n 1 2 X X t X L(w; b; ξ; λ, γ) := jjwjj + C ξ − λ fy (w x −b)−1+ξ g − γ ξ 2 i i i i i i i i=1 i=1 i=1 Dual problem is given by ( n n ) X 1 X max λ − λ λ y y hx ; x i i 2 i j i j i j i=1 i;j=1 n X subject to 0 ≤ λi ≤ C and λiyi = 0 i=1 Soft Margin Classifier Finally: Combine solution λ of dual problem and KKT optimality conditions to obtain support set S = fi : λi > 0g and optimal w; b X w = λi yi xi b = function of λ and data i2S Upshot: Optimal soft margin classification rule φ(x) = sign(h(x)) where t X h(x) = x w − b = λi yi hxi; xi − b i2S Again: Rule φ depends on feature vectors xi; x only through inner products I L~(λ) depends on xi’s only through hxi; xj i I h(x) depends on xi’s only through hxi; xi Effect of Parameter C 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −1 0 1 2 −1 0 1 2 X1 X1 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −1 0 1 2 −1 0 1 2 X1 X1 Figure: SVM with small C (the top left) to large C (bottom right). Data non-separable. (From ISL) Nonlinear SVM: Background Note: Inner product hx; x0i is signed measure of similarity between x and x0 0 0 0 I hx; x i = jjxjj jjx jj if x; x point in same direction 0 0 I hx; x i = 0 if x; x are orthogonal 0 0 0 I hx; x i = −||xjj jjx jj if x; x point in opposite directions Goal: Enhance and expand applicability of standard SVM I Map predictors x to new feature space via nonlinear transformation I Classify data using similarity between transformed features I In many cases new features space is high dimensional Direct Approach to Nonlinear SVM: Feature Maps Given: Data (x1; y1);:::; (xn; yn) 2 X × {±1g d I Define feature map γ : X! R taking predictors to HD features I Apply SVM to observations (γ(x1); y1);:::; (γ(xn); yn) Pn I SVM classifier is sign of h(x) = i=1 λi yi hγ(xi); γ(x)i − b Example 1: Two-way interactions (polynomials of degree two) p I Predictor space X = R d I Define feature map γ : X! R by γ(x) = (xi xj )1≤i;j≤p 0 2 I Computing hγ(x); γ(x )i requires d = p operations.