<<

Support Vector Machines

Andrew Nobel

April, 2020 Support Vector Machines (SVM)

I Simplest case: linear classification rule

I Generalizes to non-linear rules through feature maps and kernels

I Good off--shelf method for high dimensional data, widely used

I Begins with geometry rather than statistical model

I Close connections with convex programming

I Early bridge between machine learning and optimization Linear Classification Rules

p Setting: Labeled pairs (x, y) with predictor x ∈ R and class y ∈ {+1, −1}

p Recall: Given w ∈ R and b ∈ R linear classification rule has form

( +1 if wtx ≥ b φ(x) = sign(xtw − b) = −1 if wtx < b

Decision boundary of φ equal to hyperplane H = {x : wtx = b}.

Idea: Given pair (x, y) ask two questions

I Correctness: Is φ(x) = y?

I Confidence: How far is x from decision boundary H? Distance to Decision Boundary

Fact: For every x the ratio (xtw − b)/||w|| = signed distance from x to H Margin

Definition: The margin of the rule φ at a pair (x, y) is given by

 xtw − b  m (x, y) = y φ ||w||

Idea: Margin measures overall fit of φ at (x, y)

I mφ(x, y) > 0 iff φ(x) = y iff x on correct side of H

I mφ(x, y) < 0 iff φ(x) 6= y iff x on wrong side of H

I |mφ(x, y)| = distance from x to H (from Fact above) Max Margin Classifiers and Linearly Separability

*Goal*: In fitting a linear rule to data, would like the margins of all the data points to large and positive, if possible.

Two cases

1. Data is linearly separable ⇒ max margin classifier

2. Data is not linearly separable ⇒ soft margin classifier

Definition: Dn = (x1, y1),..., (xn, ) is linearly separable if there is a hyperplane H separating {xi : yi = 1} and {xi : yi = −1} Linearly Separable Data: Multiple Hyperplanes Max Margin Classifier (from ISL) 2 X −1 0 1 2 3

−1 0 1 2 3

X1 Maximizing the Minimum Margin

Max Margin Classifier: If Dn is linearly separable, find w and b to maximize the minimum margin of φ(x) = sign(xtw − b)

 t  xiw − b max Γ(w, b) where Γ(w, b) = min yi w,b 1≤i≤n ||w||

Fact: Solution of maxw,b Γ(w, b) is the solution of the convex program

∗ 1 2 t p = min ||w|| subject to yi(xiw − b) ≥ 1 for i = 1, . . . , n w,b 2

Finding p∗ is called the primal problem Solving the Problem of Maximizing the Minimum Margin

Approach: Solve primal problem using techniques of convex optimization: the Lagrangian function and duality

p n Step 1: Define the Lagrangian L : R × R × R+, with R+ = [0, ∞), by

n 1 2 X t L(w, b, λ) := ||w|| − λ {y (w x − b) − 1} 2 i i i i=1

Note: Lagrangian combines objective and constraints into a single function. New variables λi called Lagrange multipliers. The Lagrangian and Duality

Key idea: Lagrangian turns primal problem into min-max problem. Note that

( ||w||2 if constraints satisfied max L(w, b, λ) = λ≥0 +∞ otherwise

Thus the primal problem can be written as

p∗ = min max L(w, b, λ) w,b λ≥0

Dual: Changing order of min and max yields the dual problem

d∗ = max min L(w, b, λ) λ≥0 w,b Lagrangian and Duality

Definition: For each λ ≥ 0 define the Lagrange dual function

L˜(λ) = min L(w, b, λ) w,b

Fact

I The dual function L˜(λ) is concave and has a global maximum

∗ I Thus the dual problem d = maxλ≥0 L˜(λ) has a solution

∗ ∗ ∗ ∗ I In general, d ≤ p . Difference p − d ≥ 0 called duality gap

∗ ∗ I For our problem, d = p , that is, duality gap is zero. Lagrangian Dual Function and Dual Problem

Step 2: Fix λ ≥ 0 and minimize L(w, b, λ) over w, b. Differentiation gives

n n X X w = λi yi xi and λi yi = 0 i=1 i=1

Substituting these equations into L(w, b, λ) yields quadratic dual function

n n X 1 X L˜(λ) = λ − λ λ y y hx , x i i 2 i i j i j i=1 i,j=1

Step 3: Solve convex dual problem using quadratic programming

n X max L˜(λ) s.t. λi yi = 0 and λ1, . . . , λn ≥ 0 i=1 Solving the Problem of Maximizing the Minimum Margin

Step 4: Combine solution λ of dual problem and optimality conditions to get desired values of w and b n   X 1 t t w = λi yi xi b = min xiw + max xiw 2 i:yi=1 i:yi=−1 i=1

Upshot: Maximum margin classification rule φ(x) = sign(h(x)) where

n t X h(x) = x w − b = λi yi hxi, xi − b i=1

Note: Feature vectors xi affect rule φ only through inner products

I Dual L˜(λ) depends on xi’s only through inner products hxi, xj i

I Function h(x) depends on xi’s only through inner products hxi, xi Support Vectors

KKT Conditions: For i = 1, . . . , n optimal w, b, and λ are such that

?λ i [yih(xi) − 1] = 0 ⇒ λi = 0 or yih(xi) = 1

Let S = {i : λi > 0}. Note that

P I h(x) = i∈S λi yi hxi, xi − b

I i ∈ S ⇒ yih(xi) = 1 ⇒ xi on max margin hyperplane for class yi

Definition: Training vectors xi with i ∈ S called support vectors

I Changing a support vector with other data fixed would change the decision boundary Extending SVM to Non-Separable Case

Most data sets not linearly separable: no hyperplane can separate ±1’s 2 X −1.0 −0.5 0.0 0.5 1.0 1.5 2.0

0 1 2 3

X1

Question: How to extend maximum margin classifiers to this setting? SVM: Non-Separable Case

Idea: Reformulate primal problem. For fixed C > 0 solve convex program

( n ) 1 2 X min ||w|| + C ξi w,b,ξ 2 i=1

t s.t. yi(xiw − b) ≥ 1 − ξi and ξi ≥ 0

Here ξ1, . . . , ξn are slack variables

t I ξi captures slack in hard constraint yi(xiw − b) ≥ 1

t I ξi is distance of prediction xiw − b from optimal halfspace relative to margin size

I Cξi measures associated cost SVM: Non-Separable Case, cont.

1 2 Pn Minimizing objective function 2 ||w|| + C i=1 ξi entails tradeoff between

2 I making ||w|| small, enlarging margin around separating hyperplane

Pn I making i=1 ξi small, minimizing margin violations relative to margin size

I parameter C controls the tradeoff Slack Variables and Margins

Consider linear function h(x) = xtw − b, associated rule φ(x) = sign(h(x))

I Separating hyperplane H = {x : h(x) = 0}

+ − I Target half spaces H = {x : h(x) ≥ 1} and H = {x : h(x) ≤ −1}

Consider data point (xi, yi) with mi = yih(xi). Three cases

yi 1. If mi ≥ 1 then φ(xi) = yi and xi ∈ H , slack ξi = 0

yi 2. If 0 ≤ mi < 1 then φ(xi) = yi but xi 6∈ H , slack ξi = 1 − mi ∈ (0, 1]

yi 3. If mi < 0 then φ(xi) 6= yi and xi 6∈ H , slack ξi = 1 − mi > 1 Finding Soft Margin Classifier

Solve primal problem using same Lagrangian-dual method as separable case

Lagrangian of primal problem is

n n n 1 2 X X t X L(w, b, ξ, λ, γ) := ||w|| + C ξ − λ {y (w x −b)−1+ξ } − γ ξ 2 i i i i i i i i=1 i=1 i=1

Dual problem is given by

( n n ) X 1 X max λ − λ λ y y hx , x i i 2 i j i j i j i=1 i,j=1

n X subject to 0 ≤ λi ≤ C and λiyi = 0 i=1 Soft Margin Classifier

Finally: Combine solution λ of dual problem and KKT optimality conditions to obtain support set S = {i : λi > 0} and optimal w, b

X w = λi yi xi b = function of λ and data i∈S

Upshot: Optimal soft margin classification rule φ(x) = sign(h(x)) where

t X h(x) = x w − b = λi yi hxi, xi − b i∈S

Again: Rule φ depends on feature vectors xi, x only through inner products

I L˜(λ) depends on xi’s only through hxi, xj i

I h(x) depends on xi’s only through hxi, xi Effect of Parameter C 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

−1 0 1 2 −1 0 1 2

X1 X1 2 2 X X −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

−1 0 1 2 −1 0 1 2

X1 X1 Figure: SVM with small C (the top left) to large C (bottom right). Data non-separable. (From ISL) Nonlinear SVM: Background

Note: Inner product hx, x0i is signed measure of similarity between x and x0

0 0 0 I hx, x i = ||x|| ||x || if x, x point in same direction

0 0 I hx, x i = 0 if x, x are orthogonal

0 0 0 I hx, x i = −||x|| ||x || if x, x point in opposite directions

Goal: Enhance and expand applicability of standard SVM

I Map predictors x to new feature space via nonlinear transformation

I Classify data using similarity between transformed features

I In many cases new features space is high dimensional Direct Approach to Nonlinear SVM: Feature Maps

Given: Data (x1, y1),..., (xn, yn) ∈ X × {±1}

d I Define feature map γ : X → R taking predictors to HD features

I Apply SVM to observations (γ(x1), y1),..., (γ(xn), yn)

Pn I SVM classifier is sign of h(x) = i=1 λi yi hγ(xi), γ(x)i − b

Example 1: Two-way interactions (polynomials of degree two)

p I Predictor space X = R

d I Define feature map γ : X → R by γ(x) = (xi xj )1≤i,j≤p

0 2 I Computing hγ(x), γ(x )i requires d = p operations. Feature Maps, cont.

Example 2: Bag-of-words representation of documents

I Predictor space X = {English language documents}

I Fix set of words (vocabulary) V of interest

V I Define map γ : X → {0, 1, 2,...} from docs to word counts by

γ(x) = # occurrences of each word v ∈ V in document x

0 I Computing hγ(x), γ(x )i requires d = |V | operations

Note: Bag-of-words representation common in natural language processing Nonlinear SVM via Kernels

Basic idea: Replace inner product h·, ·i by kernel function K : X × X → R where K(, v) measures the similarity between u and v. Key assumptions

I K(u, v) = K(v, u)

I For all u1, . . . , un ∈ X the matrix {K(ui, uj ) : 1 ≤ i, j ≤ n} ≥ 0

Kernel classifier: SVM with kernel K

I Solve Lagrange dual problem, replacing hxi, xj i by K(xi, xj )

I Optimal rule rule φ(x) = sign(h(x)) where

X h(x) = λi yi K(xi, x) − b i∈S Examples of Kernels

d 1. Feature map. Given γ : X → R define kernel K(u, v) = hγ(u), γ(v)i

d d 2. Polynomial. For X = R let K(u, v) = (1 + hu, vi)

d 2 3. Radial basis. For X = R let K(u, v) = exp{−c||u − v|| }

d 4. Neural network. For X = R let K(u, v) = tanh(ahu, vi + b)

Fact: Under appropriate conditions kernel K(u, v) = hγ(u), γ(v)i for a suitable feature map γ

I Feature space may be infinite dimensional

I Computing K(u, v) may be faster than computing hγ(u), γ(v)i Revisiting the Soft Margin Classifier

Recall: Soft margin classifier has primal problem

( n ) 1 2 X t min ||w|| + C ξi s.t. yi(xiw − b) ≥ 1 − ξi and ξi ≥ 0 w,b,ξ 2 i=1

Equivalent Problem: Primal problem can be written in form

( n ) X t 2 min `h(w xi − b, yi) + λ||w|| w,b i=1

I `h(s, t) = [1 − st]+ = max(1 − st, 0) “hinge loss ”

t I `h(s, t) convex in s when t fixed, so `h(w x − b, y) convex in w, b

I Equivalent problem is a convex program Revisiting Soft Margin, cont.

Note similarity between hinge-loss problem and ridge regression

( n ) X t 2 2 min `(β xi, yi) + λ||β|| with `(s, t) = (s − t) β i=1

Sparse SVM: Connection with Ridge suggests SVM with `1-penalty

( n ) X t min `h(w xi − b, yi) + λ||w||1 w,b i=1

I The `1-penalty sets many coefficients of w to zero

I Interpretation: selecting important features

I Similar idea can be applied to logistic regression