Brett Bernstein Recitation 3: Geometric Derivation of Svms
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning { Brett Bernstein Recitation 3: Geometric Derivation of SVMs Intro Question d 1. You have been given a data set (xi; yi) for i = 1; : : : ; n where xi R and yi 1; 1 . 2 2 {− g Assume w Rd and a R. 2 2 T (a) Suppose yi(w xi +a) > 0 for all i. Use a picture to explain what this means when d = 2. T (b) Fix M > 0. Suppose yi(w xi + a) M for all i. Use a picture to explain what this means when d = 2. ≥ 2 Figure 1: Data set with xi R and yi +1; 1 2 2 f − g Solution. (a) The data is linearly separable. (b) The data is separable with geometric margin at least M= w . k k2 Both of these answers will be fleshed out in the upcoming sections. 1 Support Vector Machines Review of Geometry If v; w Rd then the component (also called scalar projection) of v in the direction w is given 2 wT v by the scalar . This can also be thought of as the signed length of v when orthogonally w 2 projected ontok thek line through the vector w. v1 T w v1 w w 2 k k wT v v2 2 w 2 k k Figure 2: Component of v1; v2 in the direction w Assuming w = 0 we can use this to interpret the set 6 S = x Rd wT x = b : f 2 j g T Noting that wT x = b w x = b we see that S contains all vectors whose component w 2 w 2 () k k k k in the direction w is b . Using linear algebra we can see this is the hyperplane orthogonal w 2 k k bw to the vector w that passes through the point 2 . Note also that there are infinitely many w 2 pairs (w; b) that give the same hyperplane. If ck =k 0 then 6 x Rd wT x = b and x Rd (cw)T x = (cb) f 2 j g f 2 j g result in the same hyperplanes. 2 wT v = 5 wT v = 4 wT v = 3 wT v = 2 wT v = 1 wT v = 0 w wT v = 1 − wT v = 2 − wT v = 3 − wT v = 4 − Figure 3: Level Surfaces of f(v) = wT v with w = 1 k k2 Given a hyperplane v wT v = b , we can distinguish points x Rd depending on whether wT x b is zero,f positive,j org negative, or in other words, whether2 x is on the hyperplane, on− the side w is pointing at, or on the side w is pointing at. − 3 wT v 15 > 0 − wT v = 15 w wT v 15 < 0 − Figure 4: Sides of the Hyperplane wT v = 15 If we have a vector x Rd and a hyperplane H = v wT v = b we can measure the distance from x to H by 2 f j g T w x b d(x; H) = − : w k k2 Without the absolute values we get the signed distance: a positive distance if wT x > b and a negative distance if wT x < b. To see why this formula is correct, note that we are computing wT x wT v ; w − w k k2 k k2 where v is any vector in the hyperplane v wT v = b . This is the difference between their components in the direction w. f j g 4 wT v = 27 wT x 20 7 2 − = w 2 √10 k k wT v = 20 x2 wT v = 12 x1 w wT x 20 8 1 − = w 2 −√10 w = √10 k k k k2 T Figure 5: Signed Distance from x1; x2 to Hyperplane w v = 20 Hard Margin SVM Returning to the initial question, suppose we have the data set (xi; yi) for i = 1; : : : ; n where d xi R and yi 1; 1 . 2 2 {− g Definition 1 (Linearly Separable). We say (xi; yi) for i = 1; : : : ; n are linearly separable if d T d T there is a w R and a R such that yi(w xi +a) > 0 for all i. The set v R w v +a = 0 is called a2separating2 hyperplane. f 2 j g T Let's examine what this definition says. If yi = +1 then we require that w xi > a and T − if yi = 1 we require that w xi < a. Thus linearly separable means that we can separate all of the− +1's from the 1's using− the hyperplane v wT v = a . For the rest of this section, we assume our data− is linearly separable. f j − g 5 Figure 6: Linearly Separable Data If we can find the w; a corresponding to a hyperplane that separates the data, we then have a decision function for classifiying elements of : f(x) = sgn(wT x + a). Before we look for such a hyperplane, we must address another issue.X If the data is linearly separable, then there are infinitely many choices of separating hyperplanes. Figure 7: Many Separating Hyperplanes Exist We will choose the hyperplane that maximizes a quantity called the geometric margin. Definition 2 (Geometric Margin). Let H be a hyperplane that separates the data (xi; yi) 6 for i = 1; : : : ; n. The geometric margin of this hyperplane is min d(xi;H); i the distance from the hyperplane to the closest data point. d T Fix w R and a R with yi(w xi + a) > 0 for all i. Then we saw earlier that 2 2 T T w xi + a yi(w xi + a) d(xi;H) = = : w w k k2 k k2 This gives us the following optimization problem: T yi(w xi + a) maximizew;a min : i w k k2 We can rewrite this in a more standard form: maximizew;a;M M y (wT x + a) subject to i i M for all i. w ≥ k k2 M M T w v+a = M w 2 k k T w v+a = 0 w 2 k k T w v+a = M w 2 k k − Figure 8: Maximum Margin Separating Hyperplane 7 Note above how the geometric margin is achieved on both sides of the optimal hyperplane. This must be the case, as otherwise we could slightly move the hyperplane and obtain a better T solution. The expression yi(w xi + a)= w 2 allows us to choose any positive value for w 2 by changing a accordingly (e.g., we cank replacek w 2w and a 2a and get the same valuek k for all (x ; y )). Thus we can fix w = 1=M and! obtain ! i i k k2 maximizew;a 1= w 2 subject to y (kwTkx + a) 1 for all i. i i ≥ To find the optimal w; a we can instead solve the minimization problem 2 minimizew;a w 2 subject to yk (wk T x + a) 1 for all i. i i ≥ This is a quadratic program that can be solved by standard packages. T Here the geometric margin is 1 which is also the minimum of yi(w xi+a) over the w 2 w 2 training data. The concept of margink k used in class (also called functional margink k ) is slightly T different, but clearly related. It is the value yi(w xi + a) denoting the score we give to a given training example. Soft Margin SVM The methods developed thus far require linearly separable data. To remove this restriction, we will allow vectors to violate the geometric margin requirements, but at a penalty. More precisely, we replace our previous SVM formulation 2 minimizew;a w 2 subject to yk (wk T x + a) 1 for all i i i ≥ with 2 C Pn minimizew,a,ξ w 2 + n i=1 ξi k k T subject to yi(w xi + a) 1 ξi for all i ξ 0 for all≥i. − i ≥ This is the standard formulation of a support vector machine. When ξi > 0 the corresponding xi violates the geometric margin condition. Each ξi is called a slack variable. The constant C controls how much we penalize violations. Rewriting the condition as y (wT x + a) 1 ξ i i − i w ≥ w k k2 k k2 shows that ξi measures the size of the violation in multiples of the geometric margin. For T example, ξi = 1 means xi lies on the decision hyperplane w v + a = 0, and ξi = 3 means xi lies 2 margin widths past the decision hyperplane. 8 ξi = 1.5 ξi = 1.5 ξi = 2 ξi = 3 Figure 9: Soft Margin SVM (unlabeled points have ξi = 0) Recall from the treatment in class, that the minimizer w will be a linear combination of some of the xi, called support vectors. More precisely, the support vectors will be some T subset of the xi that either lie on the margin boundary (yi(w xi + a) = 1) or violate the T margin boundary (yi(w xi + a) < 1, ξi > 0). Regularization Interpretation Consider the following two questions: 1. If your data is linearly separable, which SVM (hard margin or soft margin) would you use? Solution. While a hard margin SVM will work, it still often makes sense to use the soft margin SVM to avoid overfitting. We will discuss this below. 9 2. Explain geometrically what the following optimization problem computes: 1 Pn minimizew,a,ξ n i=1 ξi T subject to yi(w xi + a) 1 ξi for all i 2 2 ≥ − w 2 r ξk k 0≤ for all i.