Logistic Regression • Loss Functions Revisited

Home , Loss function

Lecture 4: More classifiers and classes

C4B Machine Learning Hilary 2011 A. Zisserman

• Logistic regression • Loss functions revisited

• Adaboost • Loss functions revisited

• Optimization

• Multiple class classification

Logistic Regression Overview

Logistic regression is actually a classiﬁcation method •

LR introduces an extra non-linearity over a linear • classiﬁer, f(x)=w>x + b, by using a logistic (or sigmoid) function, σ().

The LR classiﬁer is deﬁned as • 0.5 y =+1 σ (f(x )) i i <≥ 0.5 y = 1 ( i − 1 where σ(f(x)) = f(x) 1+e−

The logistic function or sigmoid function

0.9

0.8

0.7 1 0.6 σ(z)= 0.5 z 0.4 1+e− 0.3

0.2

0.1

0 -20 -15 -10 -5 0 5 10 15 20 z

As z goes from to , σ(z) goes from 0 to 1, a “squash- • −∞ ∞ ing function”.

It has a “sigmoid” shape (i.e. S-like shape) • σ(0) = 0.5, and if z = w x + b then dσ(z) = 1 w • > || dx ||z=0 4|| || Intuition – why use a sigmoid? Here, choose binary classiﬁcation to be repre- sented by y 0, 1 , rather than y 1, 1 i ∈ { } i ∈ { − }

Least squares fit 1 1 0.9 σ(wx + b) ﬁttoy 0.8 0.7

0.6 0.50.5 wx + b ﬁttoy 0.4 0.3

0.2

0.1

0 0-20 -15 -10 -5 0 5 10 15 20 1 1 • fit of wx + b dominated by more 0.9 distant points 0.8 0.7

0.6 • causes misclassification 0.50.5 0.4

0.3

• instead LR regresses the sigmoid 0.2 to the class data 0.1 0 0-20 -15 -10 -5 0 5 10 15 20

Similarly in 2D

LR linear LR linear

σ(w1x1 + w2x2 + b) ﬁt, vs w1x1 + w2x2 + b Learning

In logistic regression fit a sigmoid function to the data { xi, yi } by minimizing the classification errors

y σ(w x ) i − > i

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -20 -15 -10 -5 0 5 10 15 20

Margin property

A sigmoid favours a larger margin cf a step classifier

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -20 -15 -10 -5 0 5 10 15 20

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -20 -15 -10 -5 0 5 10 15 20 Probabilistic interpretation

Think of σ(f(x)) as the posterior probability that • y =1,i.e.P (y =1x)=σ(f(x)) | Hence, if σ(f(x)) > 0.5thenclassy = 1 is selected • Then, after a rearrangement • P (y =1x) P (y =1x) f(x)=log | =log | 1 P (y =1x) P (y =0x) − | | which is the log odds ratio

Maximum Likelihood Estimation

Assume

p(y =1x; w)=σ(w x) | > p(y =0x; w)=1 σ(w x) | − > write this more compactly as y (1 y) p(y x; w)= σ(w x) 1 σ(w x) − | > − > ³ ´ ³ ´

Then the likelihood (assuming data independence) is

N y (1 y ) p(y x; w) σ(w x ) i 1 σ(w x ) − i | ∼ > i − > i Yi ³ ´ ³ ´ and the negative log likelihood is N L(w)= y log σ(w x )+(1 y )log(1 σ(w x )) − i > i − i − > i Xi Logistic Regression Loss function

Use notation y 1, 1 .Then i ∈ {− } 1 P (y =1x)=σ(f(x)) = f(x) | 1+e− 1 P (y = 1 x)=1 σ(f(x)) = − | − 1+e+f(x) So in both cases 1 P (y x )= i i y f(x ) | 1+e− i i Assuming independence, the likelihood is

N 1 1+e yif(xi) Yi − andthenegativeloglikelihoodis N y f(x ) = log 1+e− i i Xi ³ ´ which deﬁnes the loss function.

Logistic Regression Learning

Learning is formulated as the optimization problem N y f(x ) 2 min log 1+e− i i + λ w w Rd || || ∈ Xi ³ ´ loss function regularization

For correctly classified points yif(xi) is negative, and • y f(x ) − log 1+e− i i is near zero ³ ´ For incorrectly classified points yif(xi)ispositive,and • y f(x ) − log 1+e− i i can be large. ³ ´ Hence the optimization penalizes parameters which lead • to such misclassifications Comparison of SVM and LR cost functions

SVM N 2 min C max (0, 1 yif(xi)) + w w Rd − || || ∈ Xi Logistic regression: N y f(x ) 2 min log 1+e− i i + λ w w Rd || || ∈ Xi ³ ´

Note: • both approximate 0-1 loss • very similar asymptotic behaviour • main difference is smoothness, and non-zero values outside margin yif(xi)

• SVM gives sparse solution for αi

AdaBoost Overview

• AdaBoost is an algorithm for constructing a strong classifier out of a linear combination T αtht(x) tX=1 of simple weak classifiers h t ( x ). It provides a method of choosing the weak classifiers and setting the weights αt

Terminology

• weak classifier h t ( x ) 1 , 1 , for data vector x ∈ {− T } • strong classifier H(x)=sign αtht(x) tX=1

Example: combination of linear classifiers h (x) 1, 1 t ∈ {− } weak weak weak strong classifier 1 classifier 2 classifier 3 classifier

h1(x) h2(x) h3(x) H(x)

H(x)=sign(α1h1(x)+α2h2(x)+α3h3(x))

• Note, this linear combination is not a simple majority vote (it would be if ) • Need to compute as well as selecting weak classifiers AdaBoost algorithm – building a strong classifier

Start with equal weights on each xi, and a set of weak classifiers ht(x) For t = 1 …,T • Select weak classifier with minimum error ² = ω [h (x ) = y ]whereω are weights t i t i 6 i i Xi •Set 1 1 ²t αt = ln − 2 ²t • Reweight examples (boosting) to give misclassified examples more weight

αtyiht(xi) ωt+1,i = ωt,ie− • Add weak classifier with weight αt T H(x)=sign αtht(x) tX=1

Example

Weak start with equal weights on Classifier 1 each data point (i)

²j = ωi[hj(xi) = yi] i 6 Weights X Increased

Weak Classifier 2

Weak classifier 3

Final classifier is linear combination of weak classifiers The AdaBoost algorithm (Freund & Shapire 1995)

Given example data (x1,y1),...,(xn,yn), where yi = 1, 1 for negative and positive examples • respectively. −

Initialize weights = 1 1 for = 1 1 respectively, where and are the number of ω1,i 2m, 2l yi , m l • negatives and positives respectively. −

For t =1,...,T • 1. Normalize the weights, ωt,i ωt,i n ← j=1 ωt,j so that ωt,i is a probability distribution. P 2. For each j,trainaweakclassiﬁer hj with error evaluated with respect to ωt,i,

²j = ωt,i[hj(xi) = yi] 6 i X 3. Choose the classiﬁer, ht,withthelowesterror²t.

4. Set αt as 1 1 ²t αt = ln − 2 ²t 5. Update the weights

αtyiht(xi) ωt+1,i = ωt,ie−

The ﬁnal strong classiﬁer is • T H(x)=sign αtht(x) t=1 X

Why does it work?

The AdaBoost algorithm carries out a greedy optimization of a loss function AdaBoost

N y H(x ) min e− i i αi,hi Xi

SVM loss function N max (0, 1 y f(x )) LR − i i Xi Logistic regression loss function N y f(x ) log 1+e− i i SVM hinge loss yif(xi) Xi ³ ´ Sketch derivation – non-examinable

The objective function used by AdaBoost is

y H(x ) J(H)= e− i i i X For a correctly classiﬁed point the penalty is exp( H ) and for an incorrectly classiﬁed point the penalty is exp(+ H ). The AdaBoost algorithm−| | incrementally decreases the cost by adding simple functions| | to

H(x)= αtht(x) t X Suppose that we have a function B and we propose to add the function αh(x)where the scalar α is to be determined and h(x) is some function that takes values in +1 or 1 only. The new function is B(x)+αh(x) and the new cost is − y B(x ) αy h(x ) J(B + αh)= e− i i e− i i i X Diﬀerentiating with respect to α and setting the result to zero gives

α y B(x ) +α y B(x ) e− e− i i e e− i i =0 − y =h(x ) y =h(x ) i X i i6 X i Rearranging, the optimal value of α is therefore determined to be

yiB(xi) 1 = ( ) e− α = log yi h xi yiB(xi) 2 y =h(x ) e− P i6 i P

The classiﬁcation error is deﬁned as

² = ωi[h(xi) = yi] 6 i X where y B(x ) e− i i ωi = yjB(xj ) j e− Then, it can be shown that, P 1 1 ² α = log − 2 ² The update from B to H therefore involves evaluating the weighted performance (with the weights ωi given above) ² of the “weak” classiﬁer h. If the current function B is B(x) = 0 then the weights will be uniform. This is a common starting point for the minimization. As a numerical convenience, note that at the next round of boosting the required weights are obtained by multiplying the old weights with exp( αyih(xi)) and then normalizing. This gives the update formula − 1 αtyiht(xi) ωt+1,i = ωt,ie− Zt where Zt is a normalizing factor. Choosing h The function h is not chosen arbitrarily but is chosen to give a good performance (low value of ²) on the training data weighted by ω. Optimization

We have seen many cost functions, e.g.

SVM N 2 min C max (0, 1 yif(xi)) + w w Rd − || || ∈ Xi

Logistic regression: N min log 1+e yif(xi) + λ w 2 − local global w Rd || || ∈ Xi ³ ´ minimum minimum

• Do these have a unique solution? • Does the solution depend on the starting point of an iterative optimization algorithm (such as gradient descent)?

If the cost function is convex, then a locally optimal point is globally optimal (provided the optimization is over a convex set, which it is in our case) Convex functions

Convex function examples

convex Not convex

A non-negative sum of convex functions is convex +

Logistic regression: N y f(x ) 2 min log 1+e− i i + λ w convex w Rd || || ∈ Xi ³ ´

SVM N 2 min C max (0, 1 yif(xi)) + w convex w Rd − || || ∈ Xi

Gradient (or Steepest) descent algorithms

To minimize a cost function (w) use the iterative update C w w η w (w ) t+1 ← t − t∇ C t where η is the learning rate.

In our case the loss function is a sum over the training data. For example for LR N N y f(x ) 2 2 min (w)= log 1+e− i i + λ w = (xi,yi; w)+λ w w Rd C || || L || || ∈ Xi ³ ´ Xi This means that one iterative update consists of a pass through the training data with an update for each point N w w ( η w (x ,y ; w )+2λw ) t+1 ← t − t∇ L i i t t Xi The advantage is that for large amounts of data, this can be carried out point by point. Gradient descent algorithm for LR

Minimizing (w) using gradient descent gives [exercise] L the update rule

w w η(y σ(w x ))x ← − i − > i i where y 0, 1 i ∈ { } Note: • this is similar, but not identical, to the perceptron update rule.

w w ηsign(w x )x ← − > i i • there is a unique solution for w • in practice more efficient Newton methods are used to minimize L • there can be problems with w becoming infinite for linearly separable data

Gradient descent algorithm for SVM

First, rewrite the optimization problem as an average N λ 2 1 min (w)= w + max (0, 1 yif(xi)) w C 2|| || N − Xi 1 N λ = w 2 +max(0, 1 y f(x )) N 2|| || − i i Xi µ ¶ (with λ =2/(NC) up to an overall scale of the problem) and f(x)=w>x + b

Because the hinge loss is not diﬀerentiable, a sub-gradient is computed Sub-gradient for hinge loss

(x ,y ; w)=max(0, 1 y f(x )) f(x )=w x + b L i i − i i i > i

∂ L = yixi ∂w −

∂ L =0 ∂w

yif(xi)

Sub-gradient descent algorithm for SVM

N 1 λ 2 (w)= w + (xi,yi; w) C N 2|| || L Xi µ ¶

The iterative update is

w w η w (w ) t+1 ← t − ∇ tC t 1 N wt η (λwt + w (xi,yi; wt)) ← − N ∇ L Xi where η is the learning rate.

Then each iteration t involves cycling through the training data with the updates:

w (1 ηλ)w + ηy x if y (w x + b) < 1 t+1 ← − t i i i > i (1 ηλ)w otherwise ← − t Multi-class Classification

Multi-Class Classification – what we would like

Assign input vector x to one of K classes Ck

Goal: a decision rule that divides input space into K decision regions separated by decision boundaries Reminder: K Nearest Neighbour (K-NN) Classifier

Algorithm • For each test point, x, to be classified, find the K nearest samples in the training data • Classify the point, x, according to the majority vote of their class labels

e.g. K = 3

• naturally applicable to multi-class case

Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x)

1 vs 2 & 3

C1 ?

C3 3 vs 1 & 2 2 vs 1 & 3 Build from binary classifiers …

• Learn: K two-class 1 vs the rest classifiers fk (x) • Classification: choose class with most positive score

1 vs 2 & 3

max fk(x) k C2

C3 3 vs 1 & 2 2 vs 1 & 3

Application: hand written digit recognition

• Feature vectors: each image is 28 x 28 pixels. Rearrange as a 784-vector x

•Training: learn k=10 two-class 1 vs the rest SVM classifiers fk (x)

• Classification: choose class with most positive score

f(x)=maxfk(x) k Example

hand drawn

5 5 1234567890 4

3 classification 5 12 2 5 34

5 5 0

Background reading and more

• Other multiple-class classifiers (not covered here): • Neural networks • Random forests

• Bishop, chapters 4.1 – 4.3 and 14.3

• Hastie et al, chapters 10.1 – 10.6

• More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml