Lecture 2: SVM classifier

C19 Hilary 2015 . Zisserman

• Review of linear classifiers •

• Support Vector Machine (SVM) classifier • Wide margin • Cost function • Slack variables • Loss functions revisited • Optimization Binary Classification

Given training data (xi,yi)fori =1...N,with d xi R and yi 1, 1 ,learnaclassifier f(x) ∈ ∈ {− } such that 0 y =+1 f(x ) i <≥ 0 y = 1 ( i − i.e. yif(xi) > 0 for a correct classification. Linear separability

linearly separable

not linearly separable Linear classifiers

A has the form f(x)=0

X2

f(x)=w>x + b

f(x) < 0 f(x) > 0

X1

• in 2D the discriminant is a • is the normal to the line, and b the bias • is known as the weight vector Linear classifiers

A linear classifier has the form f(x)=0

f(x)=w>x + b

• in 3D the discriminant is a , and in nD it is a

For a K-NN classifier it was necessary to `carry’ the training data For a linear classifier, the training data is used to learn w and then discarded Only w is needed for classifying new data The Perceptron Classifier

Given linearly separable data xi labelled into two categories yi = {-1,1} , find a weight vector w such that the discriminant function

f(xi)=w>xi + b separates the categories for i = 1, .., N • how can find this separating hyperplane ?

The Perceptron Algorithm

Write classifier as f(xi)=˜w>˜xi + w0 = w>xi

where w =(˜w,w0), xi =(˜xi, 1) • Initialize w = 0

• Cycle though the data points { xi, yi } •if x is misclassified then w w + α sign(f(x )) x i ← i i • Until all the data is correctly classified For example in 2D

• Initialize w = 0

• Cycle though the data points { xi, yi } •if x is misclassified then w w + α sign(f(x )) x i ← i i • Until all the data is correctly classified

before update after update

X2 X2

w w

X1 X1 xi

N NB after convergence w = i αixi P 8

Perceptron 6 example 4

2

0

-2

-4

-6

-8

-10

-15 -10 -5 0 5 10

• if the data is linearly separable, then the algorithm will converge • convergence can slow … • separating line close to training data • we would prefer a larger margin for generalization What is the best w?

• maximum margin solution: most stable under perturbations of the inputs Support Vector Machine linearly separable data

wTx + b = 0

b w || ||

Support Vector Support Vector

w

f(x)= αiyi(xi>x)+b i X support vectors SVM – sketch derivation

Since w x + b =0andc(w x + b)=0define the same • > > plane, we have the freedom to choose the normalization of w

Choose normalization such that w>x++b =+1andw>x + • − b = 1 for the positive and negative support vectors re- − spectively

Then the margin is given by • w w> x+ x 2 . x+ x = − − = w − − ³ w ´ w || || ³ ´ || || || || Support Vector Machine linearly separable data Margin = 2 w || ||

Support Vector Support Vector

wTx + b = 1 w

wTx + b = 0

wTx + b = -1 SVM – Optimization

Learning the SVM can be formulated as an optimization: • 2 1ifyi =+1 max subject to w>xi+b ≥ for i =1...N w w 1ifyi = 1 || || ≤− −

Or equivalently • min w 2 subject to y w x + b 1fori =1...N w || || i > i ≥ ³ ´

This is a quadratic optimization problem subject to linear • constraints and there is a unique minimum Linear separability again: What is the best w?

• the points can be linearly separated but there is a very narrow margin

• but possibly the large margin solution is better, even though one constraint is violated

In general there is a trade off between the margin and the number of mistakes on the training data Introduce “slack” variables

ξ 2 i > 2 ξ 0 w w Margin = i || || || || w ≥ Misclassified || || for 0 < ξ 1pointisbetween ξ 1 • ≤ i < margin and correct side of hyper- w w plane. This is a margin violation || || || || for ξ > 1 point is misclassified • Support Vector Support Vector  = 0

wTx + b = 1 w

wTx + b = 0

wTx + b = -1 “Soft” margin solution

The optimization problem becomes N 2 min w +C ξi d + w R ,ξi R || || ∈ ∈ Xi subject to y w x + b 1 ξ for i =1...N i > i ≥ − i ³ ´

Every constraint can be satisfied if ξ is sufficiently large • i C is a regularization parameter: • — small C allows constraints to be easily ignored large margin → — large C makes constraints hard to ignore narrow margin → — C = enforces all constraints: hard margin ∞ This is still a quadratic optimization problem and there is a • unique minimum. Note, there is only one parameter, C. 0.8

0.6

0.4

0.2

feature y 0

-0.2

-0.4

-0.6

-0.8 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 feature x

• data is linearly separable • but only with a narrow margin C = Infinity hard margin C = 10 soft margin Application: Pedestrian detection in Computer Vision

Objective: detect (localize) standing humans in an image • cf face detection with a sliding window classifier

• reduces object detection to binary classification

• does an image window contain a person or not?

Method: the HOG detector Training data and features

• Positive data – 1208 positive window examples

• Negative data – 1218 negative window examples (initially) Feature: histogram of oriented gradients (HOG)

dominant image direction HOG

• tile window into 8 x 8 pixel cells • each cell represented by HOG frequency orientation

Feature vector dimension = 16 x 8 (for tiling) x 8 (orientations) = 1024

Averaged positive examples Algorithm

Training (Learning) • Represent each example window by a HOG feature vector

d xi ∈ R , with d = 1024

• Train a SVM classifier

Testing (Detection) • Sliding window classifier

f(x)=w>x + b Dalal and Triggs, CVPR 2005 Learned model

f(x)=w>x + b

Slide from Deva Ramanan Slide from Deva Ramanan Optimization Learning an SVM has been formulated as a constrained optimization prob- lem over w and ξ N 2 min w + C ξi subject to yi w>xi + b 1 ξi for i =1...N d + w R ,ξi R || || ≥ − ∈ ∈ Xi ³ ´

The constraint y w x + b 1 ξ , can be written more concisely as i > i ≥ − i ³ ´ y f(x ) 1 ξ i i ≥ − i which, together with ξ 0, is equivalent to i ≥ ξ =max(0, 1 y f(x )) i − i i Hence the learning problem is equivalent to the unconstrained optimiza- tion problem over w N 2 min w + C max (0, 1 yif(xi)) w Rd || || − ∈ Xi regularization Loss function

N 2 min w + C max (0, 1 yif(xi)) d w R || || i − ∈ X wTx + b = 0 loss function

Points are in three categories: Support Vector

1. yif(xi) > 1 Point is outside margin. No contribution to loss Support Vector 2. yif(xi)=1 Point is on margin. No contribution to loss. As in hard margin case. w 3. yif(xi) < 1 Point violates margin constraint. Contributes to loss Loss functions

yif(xi)

• SVM uses “hinge” loss max (0, 1 y f(x )) − i i • an approximation to the 0-1 loss Optimization continued

N 2 min C max (0, 1 yif(xi)) + w w Rd − || || ∈ Xi local global minimum minimum

• Does this cost function have a unique solution? • Does the solution depend on the starting point of an iterative optimization algorithm (such as )?

If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case) Convex functions Convex function examples

convex Not convex

A non-negative sum of convex functions is convex +

SVM N 2 min C max (0, 1 yif(xi)) + w convex w Rd − || || ∈ Xi Gradient (or steepest) descent algorithm for SVM

To minimize a cost function (w) use the iterative update C w w η w (w ) t+1 ← t − t∇ C t where η is the learning rate. First, rewrite the optimization problem as an average N λ 2 1 min (w)= w + max (0, 1 yif(xi)) w C 2|| || N − Xi 1 N λ = w 2 +max(0, 1 y f(x )) N 2|| || − i i Xi µ ¶ (with λ =2/(NC) up to an overall scale of the problem) and f(x)=w>x + b

Because the hinge loss is not differentiable, a sub-gradient is computed Sub-gradient for hinge loss

(x ,y ; w)=max(0, 1 y f(x )) f(x )=w x + b L i i − i i i > i

∂ L = yixi ∂w −

∂ L =0 ∂w

yif(xi) Sub-gradient descent algorithm for SVM

1 N λ (w)= w 2 + (x ,y ; w) C N 2|| || L i i Xi µ ¶ The iterative update is

w w η w (w ) t+1 ← t − ∇ tC t 1 N wt η (λwt + w (x ,y ; wt)) ← − N ∇ L i i Xi where η is the learning rate.

Then each iteration t involves cycling through the training data with the updates:

w w η(λw y x )ify f(x ) < 1 t+1 ← t − t − i i i i w ηλw otherwise ← t − t 1 In the Pegasos algorithm the learning rate is set at ηt = λt Pegasos – Stochastic Gradient Descent Algorithm

Randomly sample from the training data

2 10 6

4 1 10

2

0 10 0 energy

-2 -1 10 -4

-2 10 -6 0 50 100 150 200 250 300 -6 -4 -2 0 2 4 6 Background reading and more …

• Next lecture – see that the SVM can be expressed as a sum over the support vectors:

f(x)= αiyi(xi>x)+b i X support vectors

• On web page: http://www.robots.ox.ac.uk/~az/lectures/ml

• links to SVM tutorials and video lectures

• MATLAB SVM demo