Support Vector Machines
Andrew Nobel
April, 2020 Support Vector Machines (SVM)
I Simplest case: linear classification rule
I Generalizes to non-linear rules through feature maps and kernels
I Good off-the-shelf method for high dimensional data, widely used
I Begins with geometry rather than a statistical model
I Close connections with convex programming
I Early bridge between machine learning and optimization Linear Classification Rules
p Setting: Labeled pairs (x, y) with predictor x ∈ R and class y ∈ {+1, −1}
p Recall: Given w ∈ R and b ∈ R linear classification rule has form
( +1 if wtx ≥ b φ(x) = sign(xtw − b) = −1 if wtx < b
Decision boundary of φ equal to hyperplane H = {x : wtx = b}.
Idea: Given pair (x, y) ask two questions
I Correctness: Is φ(x) = y?
I Confidence: How far is x from decision boundary H? Distance to Decision Boundary
Fact: For every x the ratio (xtw − b)/||w|| = signed distance from x to H Margin
Definition: The margin of the rule φ at a pair (x, y) is given by
xtw − b m (x, y) = y φ ||w||
Idea: Margin measures overall fit of φ at (x, y)
I mφ(x, y) > 0 iff φ(x) = y iff x on correct side of H
I mφ(x, y) < 0 iff φ(x) 6= y iff x on wrong side of H
I |mφ(x, y)| = distance from x to H (from Fact above) Max Margin Classifiers and Linearly Separability
*Goal*: In fitting a linear rule to data, we would like the margins of all the data points to be large and positive, if possible.
Two cases
1. Data is linearly separable ⇒ max margin classifier
2. Data is not linearly separable ⇒ soft margin classifier
Definition: Dn = (x1, y1),..., (xn, yn) is linearly separable if there is a hyperplane H separating {xi : yi = 1} and {xi : yi = −1} Linearly Separable Data: Multiple Hyperplanes Max Margin Classifier (from ISL) 2 X −1 0 1 2 3
−1 0 1 2 3
X1 Maximizing the Minimum Margin
Max Margin Classifier: If Dn is linearly separable, find w and b to maximize the minimum margin of φ(x) = sign(xtw − b)
t xiw − b max Γ(w, b) where Γ(w, b) = min yi w,b 1≤i≤n ||w||
Fact: Solution of maxw,b Γ(w, b) is the solution of the convex program
∗ 1 2 t p = min ||w|| subject to yi(xiw − b) ≥ 1 for i = 1, . . . , n w,b 2
Finding p∗ is called the primal problem Solving the Problem of Maximizing the Minimum Margin
Approach: Solve primal problem using techniques of convex optimization: the Lagrangian function and duality
p n Step 1: Define the Lagrangian L : R × R × R+, with R+ = [0, ∞), by
n 1 2 X t L(w, b, λ) := ||w|| − λ {y (w x − b) − 1} 2 i i i i=1
Note: Lagrangian combines objective and constraints into a single function. New variables λi called Lagrange multipliers. The Lagrangian and Duality
Key idea: Lagrangian turns primal problem into min-max problem. Note that
( ||w||2 if constraints satisfied max L(w, b, λ) = λ≥0 +∞ otherwise
Thus the primal problem can be written as
p∗ = min max L(w, b, λ) w,b λ≥0
Dual: Changing order of min and max yields the dual problem
d∗ = max min L(w, b, λ) λ≥0 w,b Lagrangian and Duality
Definition: For each λ ≥ 0 define the Lagrange dual function
L˜(λ) = min L(w, b, λ) w,b
Fact
I The dual function L˜(λ) is concave and has a global maximum
∗ I Thus the dual problem d = maxλ≥0 L˜(λ) has a solution
∗ ∗ ∗ ∗ I In general, d ≤ p . Difference p − d ≥ 0 called duality gap
∗ ∗ I For our problem, d = p , that is, duality gap is zero. Lagrangian Dual Function and Dual Problem
Step 2: Fix λ ≥ 0 and minimize L(w, b, λ) over w, b. Differentiation gives
n n X X w = λi yi xi and λi yi = 0 i=1 i=1
Substituting these equations into L(w, b, λ) yields quadratic dual function
n n X 1 X L˜(λ) = λ − λ λ y y hx , x i i 2 i j i j i j i=1 i,j=1
Step 3: Solve convex dual problem using quadratic programming
n X max L˜(λ) s.t. λi yi = 0 and λ1, . . . , λn ≥ 0 i=1 Solving the Problem of Maximizing the Minimum Margin
Step 4: Combine solution λ of dual problem and optimality conditions to get desired values of w and b n X 1 t t w = λi yi xi b = min xiw + max xiw 2 i:yi=1 i:yi=−1 i=1
Upshot: Maximum margin classification rule φ(x) = sign(h(x)) where
n t X h(x) = x w − b = λi yi hxi, xi − b i=1
Note: Feature vectors xi affect rule φ only through inner products
I Dual L˜(λ) depends on xi’s only through inner products hxi, xj i
I Function h(x) depends on xi’s only through inner products hxi, xi Support Vectors
KKT Conditions: For i = 1, . . . , n optimal w, b, and λ are such that
?λ i [yih(xi) − 1] = 0 ⇒ λi = 0 or yih(xi) = 1
Let S = {i : λi > 0}. Note that
P I h(x) = i∈S λi yi hxi, xi − b
I i ∈ S ⇒ yih(xi) = 1 ⇒ xi on max margin hyperplane for class yi
Definition: Training vectors xi with i ∈ S called support vectors
I Changing a support vector with other data fixed would change the decision boundary Extending SVM to Non-Separable Case
Most data sets not linearly separable: no hyperplane can separate ±1’s 2 X −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
0 1 2 3
X1
Question: How to extend maximum margin classifiers to this setting? SVM: Non-Separable Case
Idea: Reformulate primal problem. For fixed C > 0 solve convex program
( n ) 1 2 X min ||w|| + C ξi w,b,ξ 2 i=1
t s.t. yi(xiw − b) ≥ 1 − ξi and ξi ≥ 0
Here ξ1, . . . , ξn are slack variables
t I ξi captures slack in hard constraint yi(xiw − b) ≥ 1
t I ξi is distance of prediction xiw − b from optimal halfspace relative to margin size
I Cξi measures associated cost SVM: Non-Separable Case, cont.
1 2 Pn Minimizing objective function 2 ||w|| + C i=1 ξi entails tradeoff between
2 I making ||w|| small, enlarging margin around separating hyperplane
Pn I making i=1 ξi small, minimizing margin violations relative to margin size
I parameter C controls the tradeoff Slack Variables and Margins
Consider linear function h(x) = xtw − b, associated rule φ(x) = sign(h(x))
I Separating hyperplane H = {x : h(x) = 0}
+ − I Target half spaces H = {x : h(x) ≥ 1} and H = {x : h(x) ≤ −1}
Consider data point (xi, yi) with mi = yih(xi). Three cases
yi 1. If mi ≥ 1 then φ(xi) = yi and xi ∈ H , slack ξi = 0
yi 2. If 0 ≤ mi < 1 then φ(xi) = yi but xi 6∈ H , slack ξi = 1 − mi ∈ (0, 1]
yi 3. If mi < 0 then φ(xi) 6= yi and xi 6∈ H , slack ξi = 1 − mi > 1 Finding Soft Margin Classifier
Solve primal problem using same Lagrangian-dual method as separable case
Lagrangian of primal problem is
n n n 1 2 X X t X L(w, b, ξ, λ, γ) := ||w|| + C ξ − λ {y (w x −b)−1+ξ } − γ ξ 2 i i i i i i i i=1 i=1 i=1
Dual problem is given by
( n n ) X 1 X max λ − λ λ y y hx , x i i 2 i j i j i j i=1 i,j=1
n X subject to 0 ≤ λi ≤ C and λiyi = 0 i=1 Soft Margin Classifier
Finally: Combine solution λ of dual problem and KKT optimality conditions to obtain support set S = {i : λi > 0} and optimal w, b
X w = λi yi xi b = function of λ and data i∈S
Upshot: Optimal soft margin classification rule φ(x) = sign(h(x)) where
t X h(x) = x w − b = λi yi hxi, xi − b i∈S
Again: Rule φ depends on feature vectors xi, x only through inner products
I L˜(λ) depends on xi’s only through hxi, xj i
I h(x) depends on xi’s only through hxi, xi Effect of Parameter C 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
−1 0 1 2 −1 0 1 2
X1 X1 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
−1 0 1 2 −1 0 1 2
X1 X1 Figure: SVM with small C (the top left) to large C (bottom right). Data non-separable. (From ISL) Nonlinear SVM: Background
Note: Inner product hx, x0i is signed measure of similarity between x and x0
0 0 0 I hx, x i = ||x|| ||x || if x, x point in same direction
0 0 I hx, x i = 0 if x, x are orthogonal
0 0 0 I hx, x i = −||x|| ||x || if x, x point in opposite directions
Goal: Enhance and expand applicability of standard SVM
I Map predictors x to new feature space via nonlinear transformation
I Classify data using similarity between transformed features
I In many cases new features space is high dimensional Direct Approach to Nonlinear SVM: Feature Maps
Given: Data (x1, y1),..., (xn, yn) ∈ X × {±1}
d I Define feature map γ : X → R taking predictors to HD features
I Apply SVM to observations (γ(x1), y1),..., (γ(xn), yn)
Pn I SVM classifier is sign of h(x) = i=1 λi yi hγ(xi), γ(x)i − b
Example 1: Two-way interactions (polynomials of degree two)
p I Predictor space X = R
d I Define feature map γ : X → R by γ(x) = (xi xj )1≤i,j≤p
0 2 I Computing hγ(x), γ(x )i requires d = p operations. Feature Maps, cont.
Example 2: Bag-of-words representation of documents
I Predictor space X = {English language documents}
I Fix set of words (vocabulary) V of interest
V I Define map γ : X → {0, 1, 2,...} from docs to word counts by
γ(x) = # occurrences of each word v ∈ V in document x
0 I Computing hγ(x), γ(x )i requires d = |V | operations
Note: Bag-of-words representation common in natural language processing Nonlinear SVM via Kernels
Basic idea: Replace inner product h·, ·i by kernel function K : X × X → R where K(u, v) measures the similarity between u and v. Key assumptions
I K(u, v) = K(v, u)
I For all u1, . . . , un ∈ X the matrix {K(ui, uj ) : 1 ≤ i, j ≤ n} ≥ 0
Kernel classifier: SVM with kernel K
I Solve Lagrange dual problem, replacing hxi, xj i by K(xi, xj )
I Optimal rule rule φ(x) = sign(h(x)) where
X h(x) = λi yi K(xi, x) − b i∈S Examples of Kernels
d 1. Feature map. Given γ : X → R define kernel K(u, v) = hγ(u), γ(v)i
d d 2. Polynomial. For X = R let K(u, v) = (1 + hu, vi)
d 2 3. Radial basis. For X = R let K(u, v) = exp{−c||u − v|| }
d 4. Neural network. For X = R let K(u, v) = tanh(ahu, vi + b)
Fact: Under appropriate conditions kernel K(u, v) = hγ(u), γ(v)i for a suitable feature map γ
I Feature space may be infinite dimensional
I Computing K(u, v) may be faster than computing hγ(u), γ(v)i Revisiting the Soft Margin Classifier
Recall: Soft margin classifier has primal problem
( n ) 1 2 X t min ||w|| + C ξi s.t. yi(xiw − b) ≥ 1 − ξi and ξi ≥ 0 w,b,ξ 2 i=1
Equivalent Problem: Primal problem can be written in form
( n ) X t 2 min `h(w xi − b, yi) + λ||w|| w,b i=1
I `h(s, t) = [1 − st]+ = max(1 − st, 0) “hinge loss ”
t I `h(s, t) convex in s when t fixed, so `h(w x − b, y) convex in w, b
I Equivalent problem is a convex program Revisiting Soft Margin, cont.
Note similarity between hinge-loss problem and ridge regression
( n ) X t 2 2 min `(β xi, yi) + λ||β|| with `(s, t) = (s − t) β i=1
Sparse SVM: Connection with Ridge suggests SVM with `1-penalty
( n ) X t min `h(w xi − b, yi) + λ||w||1 w,b i=1
I The `1-penalty sets many coefficients of w to zero
I Interpretation: selecting important features
I Similar idea can be applied to logistic regression