CIS 520: Machine Learning Spring 2019: Lecture 5
Support Vector Machines for Classification and Regression
Lecturer: Shivani Agarwal
Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the material discussed in the lecture (and vice versa).
Outline
• Linearly separable data: Hard margin SVMs • Non-linearly separable data: Soft margin SVMs • Loss minimization view • Support vector regression (SVR)
1 Linearly Separable Data: Hard Margin SVMs
In this lecture we consider linear support vector machines (SVMs); we will consider nonlinear extensions in the next lecture. Let X = Rd, and consider a binary classification task with Y = Yb = {±1}.A d m training sample S = ((x1, y1),..., (xm, ym)) ∈ (R × {±1}) is said to be linearly separable if there > exists a linear classifier hw,b(x) = sign(w x + b) which classifies all examples in S correctly, i.e. for which > 2 yi(w xi + b) > 0 ∀i ∈ [m]. For example, Figure 1 (left) shows a training sample in R that is linearly separable, together with two possible linear classifiers that separate the data correctly (note that the decision surface of a linear classifier in 2 dimensions is a line, and more generally in d > 2 dimensions is a hyperplane). Which of the two classifiers is likely to give better generalization performance?
Figure 1: Left: A linearly separable data set, with two possible linear classifiers that separate the data. Blue circles represent class label 1 and red crosses −1; the arrow represents the direction of positive classification. Right: The same data set and classifiers, with margin of separation shown.
Although both classifiers separate the data, the distance or margin with which the separation is achieved is different; this is shown in Figure 1 (right). For the rest of this section, assume that the training sample S = ((x1, y1),..., (xm, ym)) is linearly separable; in this setting, the SVM algorithm selects the maximum
1 2 Support Vector Machines for Classification and Regression
margin linear classifier, i.e. the linear classifier that separates the training data with the largest margin. > More precisely, define the (geometric) margin of a linear classifier hw,b(x) = sign(w x + b) on an d example (xi, yi) ∈ R × {±1} as > yi(w xi + b) γi = . (1) kwk2 > > |w xi+b| Note that the distance of xi from the hyperplane w x+b = 0 is given by ; therefore the above margin kwk2 on (xi, yi) is simply a signed version of this distance, with a positive sign if the example is classified correctly and negative otherwise. The (geometric) margin of hw,b on the sample S = ((x1, y1),..., (xm, ym)) is then defined as the minimal margin on examples in S:
γ = min γi . (2) i∈[m]
d m Given a linearly separable training sample S = ((x1, y1),..., (xm, ym)) ∈ (R × {±1}) , the hard margin SVM algorithm finds a linear classifier that maximizes the above margin on S. In particular, any linear classifier that separates S correctly will have margin γ > 0; without loss of generality, we can represent any such classifier by some (w, b) such that
> min yi(w xi + b) = 1 . (3) i∈[m] The margin of such a classifier on S then becomes simply y (w>x + b) 1 γ = min i i = . (4) i∈[m] kwk2 kwk2
Thus, maximizing the margin becomes equivalent to minimizing the norm kwk2 subject to the constraints in Eq. (3), which can be written as the following optimization problem:
1 2 min kwk2 (5) w,b 2 subject to > yi(w xi + b) ≥ 1 , i = 1, . . . , m. (6) This is a convex quadratic program (QP) and can in principle be solved directly. However it is useful to consider the dual of the above problem, which sheds light on the structure of the solution and also facilitates the extension to nonlinear classifiers which we will see in the next lecture. Note that by our assumption that the data is linearly separable, the above problem satisfies Slater’s condition, and so strong duality holds. Therefore solving the dual problem is equivalent to solving the above primal problem. Introducing dual variables (or Lagrange multipliers) αi ≥ 0 (i = 1, . . . , m) for the inequality constraints above gives the Lagrangian function m 1 X L(w, b, α) = kwk2 + α (1 − y (w>x + b)) . (7) 2 2 i i i i=1 The(Lagrange) dual function is then given by φ(α) = inf L(w, b, α) . d w∈R ,b∈R To compute the dual function, we set the derivatives of L(w, b, α) w.r.t. w and b to zero; this gives the following: m X w = αiyixi (8) i=1 m X αiyi = 0 . (9) i=1 Support Vector Machines for Classification and Regression 3
Substituting these back into L(w, b, α), we have the following dual function:
m m m 1 X X X φ(α) = − α α y y (x>x ) + α ; 2 i j i j i j i i=1 j=1 i=1
m Pm this dual function is defined over the domain α ∈ R : i=1 αiyi = 0 . This leads to the following dual problem:
m m m 1 X X > X max − αiαjyiyj(xi xj) + αi (10) α 2 i=1 j=1 i=1 subject to m X αiyi = 0 (11) i=1
αi ≥ 0 , i = 1, . . . , m. (12)
This is again a convex QP (in the m variables αi) and can be solved efficiently using numerical optimization methods. On obtaining the solution αb to the above dual problem, the weight vector wb corresponding to the maximal margin classifier can be obtained via Eq. (8):
m X wb = αbiyixi . i=1 Now, by the complementary slackness condition in the KKT conditions, we have for each i ∈ [m],