<<

The

Nuno Vasconcelos ECE Depp,artment, UCSD Classification

a classification problem has two types of variables • X - vector of observations (features) in the world • Y - state (class) of the world e.g. • x ∈ X ⊂ R2 = (fever, blood pressure) • y ∈ Y = {disease, no disease} X, Y related by a (unknown) x y = f (x ) f (.)

goal: didesign a c lass ifier h: X → Y suchthth that h(x ) = f(x) ∀x

2 Linear discriminant

the classifier implements the linear decision rule 1 if g (x ) > 0 ⎧ T h *(x ) = ⎨ g = sgn[g(x)] with g (x ) =w x + b ⎩−1 if (x ) < 0 has the properties w • it divides X into two “half-spaces” • boundaryyp is the plane with: x • normal w gx( ) b • distance to the origin b/||w|| w w •g(x )/||w || is the dis tance f rom poi nt x to the boundary • g(x) = 0 for points on the plane • g(x) > 0 on the side w points to (“positive side”) • g(x) < 0 on the “negative side”

3 Linear discriminant

the classifier implements the linear decision rule 1 if g (x ) > 0 ⎧ T h *(x ) = ⎨ g = sgn[g(x)] with g (x ) =w x + b ⎩−1 if (x ) < 0 given a linearly separable training set w y=1 D = {(x1,y1), ..., (xn,yn)}

no errors ifff and only if, ∀ i x gx( ) • y = 1 and g(x ) > 0 or b i i w yi = -1 and g(xi) < 0 w • i.e. y .g(x ) > 0 i i y=-1 this allows a very concise expression fhifor the situat ion o f“iif “no training error ”“” or “zero emp iilik”irical risk”

4 Learning as optimization

necessary and sufficient condition for zero empirical risk

T y i (w x i + b )> 0, ∀i this is interesting because it allows the formulation of the learn ing pro blem as one o f functi on op tim iza tion • starting from a random guess for the parameters w and b n • we maximize the reward function y w T x ∑ i ()i + b i =1 • or, equivalently, minimize the cost function

n T J (w ,b )= −∑y (w x i + b ) i =1 i

5 The

we have seen that the gradient of a function f(w) at z is T ⎛ ∂f ∂f ⎞ ∇f (z) = ⎜ (z),L, (z)⎟ ∇f ⎝ ∂w0 ∂wn−1 ⎠ Theorem: the gradient points in the direction of maximum growth f(x,y) gradient is • the direction of greatest increase of f(x) at z, (*)

• normal to the iso -contours of f(.) ∇f (x0, y0 )

∇f (x1, y1)

6 Critical point conditions

let f(x) be continuously differentiable x* is a local minimum of f(x) if and only if • f has zero gradient at x* t ∇f (x *) = 0 • and the Hessian of f at x* is positive definite d ∇2f (x *)d ≥ 0, ∀d ∈ℜn • where ⎡ ∂ 2f ∂ 2f ⎤ (x ) (x ) f ⎢ 2 L x ⎥ x ∂x 0 ∂ 0∂x n −1 2 ⎢ ⎥ ∇ ( ) = ⎢ n M ⎥ 2 2 ⎢ x∂ f ∂ f ⎥ ⎢ x (x ) L x2 (x ) ⎥ ⎣∂ −1∂ 0 ∂ n −1 ⎦

7

this suggest a simple minimization technique

• pick initial estimate x(()0) f(x) • follow the negative gradient x (n +1) = x (n ) −η∇f (x (n ) ) −η∇f (x (n ) ) this is gradient descent x (n ) η is the and needs to be carefully chosen

• if η too large, descent may diverge f(x) many extensions are possible main point:

• once framed as optimization, −η∇f (x (n ) ) we can (in general) solve it x (n ) 8 The perceptron

this was the main insight of Rosenblatt, which lead to the Perceptron J w the basic idea is bto do gradient descent on our cost y w n T x (), = −∑ ( i + b ) i =1 i

we know that: • if the training set is linearly separable there is at least a pair (w,b) such that J(w, b) < 0 • any minimum that is equal to or better than this will do Q: can we find one such minimum?

9 Perceptron learning

i the gradient isi straightforwardi to compute ∂f ∂f = − y x = − y ∑ ∑ i ∂w ∂b i and gradient descent is trivial there is, however, one problem: • J((,)w,b) is not bounded below • if J(w,b) < 0, can make J → −∞ by multiplying w and b by λ > 0 • the minimum is always at −∞ which is quite bad, numerically this is really just the normalization problem that we already talked about

10 Rosenblatt’s idea

restrict attention to the points incorrectly classified E at each iterationx define set of errors y i w i T x = { | ( i + b ) < 0}

and makep the cost

T J (w ,b )= − y i (w x + b ) i x∑ | ∈E i i note that T • Jp cannot be negative since, in E, all yi(w xi+b) are negative • if we gg,et to zero, we know we have the best p ossible solution (E empty)

11 Perceptron learning

is trivial, justn do gradient descent on Jp(w,b) wb ( +1) =w (n ) +η ∑y x n b i ∈ x i ( +1) (n ) E i = +η y i x∑ i ∈E this turns out not to be very effective if the D is large • lthtitiitttkllttthdloop over the entire training set to take a small step at the end one alternative that frequently is better is “stochastic gradient descent” • take the step immediately after each point • no guarantee this is a descent step but, on average, you follow the same directi on a fter process ing en tire D • very popular in learning, where D is usually large

12 Perceptron learning the is as follows:

set k = 0, wk = 0, bk = 0

set R = maxi ||xi|| do { for i = 1:n { T if yi(w xi + bk) < 0 then {

– wk+1 = wk + η yi xi 2 we will talk about – bk+1 = bk + η yi R – k=k+1k = k+1 R shortly! } } T } until yi(w xi + bk) ≥ 0, ∀ i (no errors)

13 Perceptron learning does this make sense? consider the example below

set k = 0, wk = 0, bk = 0

set R = maxi ||xi||

do { x2 y=1

for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors)

14 Perceptron learning does this make sense? consider the example below

set k = 0, wk = 0, bk = 0

set R = maxi ||xi||

do { x2 y=1

for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o – k=k+1k = k+1 o o o o } o o

} bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors)

15 Perceptron learning does this make sense? consider the example below

set k = 0, wk = 0, bk = 0

set R = maxi ||xi||

do { x2 y=1

for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o wk+ ηyixi – k=k+1k = k+1 o o o o } o o

} bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors)

16 Perceptron learning does this make sense? consider the example below

set k = 0, wk = 0, bk = 0

set R = maxi ||xi||

do { x2 y=1

for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 wk+1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors)

17 Perceptron learning does this make sense? consider the example below

set k = 0, wk = 0, bk = 0

set R = maxi ||xi||

do { x2 y=1

for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 wk+1 o o o o – k=k+1k = k+1 o o o o o o } 2 bk +ηyiR } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors)

18 Perceptron learning does this make sense? consider the example below

set k = 0, wk = 0, bk = 0

set R = maxi ||xi||

do { x2 y=1

for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 o o o wk+1 o – k=k+1k = k+1 o o o o } o o } bk+1 x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors)

19 Perceptron learning OK, makes intuitive sense how do we know it will not get stuck on a local minimum? this was Rosenblatt’s seminal contribution

Theorem: Let DR = {(x{( 1,y1), ..., (x ( n,yn)} and = maxi || x || (*) y w i x If there is (w * ,b* ) such thatb ||w *|| = 1 and i T ( * i + *)> γ , ∀i (**) then the Perceptron will find an error free hyper-plane in at most 2 ⎛ 2R ⎞ ⎜ ⎟ iterations ⎝ γ ⎠

20 Proof

not that hard denote iteration by t, assume point processed at iteration t-1 is (xi,yi) for simppy,licity, use homogeneous coordinates. Defining a z i ⎡x i ⎤ t ⎡ w t ⎤ = ⎢ ⎥ = ⎢ ⎥ Ri b /R ⎣ ⎦ t ⎣ t ⎦ i allows the compact notationt

T i tT y (w −1x + b −1 )= y a −1z i since only misclassified points are processed, we have T y i at −1z i < 0 (***)

21 t t

Proof t t

t i how does a evolve? i t y i R at ⎡ w ⎤ ⎡ w −1 ⎤ ⎡y x ⎤ y i a = b = b + = −1 + z i T ⎢ /R ⎥ ⎢ /R ⎥ ⎢ ⎥ at a ⎣ ⎦ ⎣ −1 ⎦ ⎣ ⎦ a dtithtilltibdenoting the topatimal solution by a*(*b*/R)* = (w*, b*/R)T T yi zi η * = −1 *+η a * i = a a *+ηy (w *T x + b *) t −1 i η and, from (**), T at a* > at −1a *+ηγ

solving the recursiont T at a* > a −1a *+ηγ > at −2a *+2ηγ > ... > tηγ

22 Proof

this means convergence to a* if we can bound the a at T magnitude of at t . What is this magnitude? Since aat t at yai = −1 +η zt i ηy T we have i at z i 2 2 η 2 2 t = = −1 + 2 −1 + z i t and, 2 a 2 2 2 a < a +η iz from (***) t −1 2 x 2 2 2 from def of z = a −1 +η ()i +R 2 2 2 < t −1 + 2η R from (**) solving the recursion

2 2 2 at < 2tη R

23 Proof

combining the two

T tηγ < at a* < at . a * < a * 2tηR and 2 2 2 2R 2 2R ⎛ 2 b * ⎞ t < a * = ⎜ w * + ⎟ from def of a* 2 2 ⎜ 2 ⎟ ⎝ R ⎠ γ 2R 2 ⎛ b *2 ⎞ = ⎜1+ from ||w*|| = 1 2 ⎜ 2 ⎟γ γ ⎝ R ⎠ w since b* /||w*|| = b * is the distance to the origin, we have b* < R = maxi ||xi|| and x g (x ) b 2 w ⎛ 2R ⎞ w t < ⎜ ⎟ ⎝ γ ⎠

24 Note

this is not the “standard proof” (e.g. Duda, Hart, Stork) standard proof: • regular algorithm (no R in update equations) • tighter bound t < (R/γ)2 this appears better, but requires choosing η = R2/γ which reqqguires knowledge of γ, that we don’t have until we find a* i.e. the proof is non-constructive, cannot design algorithm tha t way the algorithm above just works! hence, I like this proof b ett er d espit e l ooser boun d.

25 Perceptron learning

Theorem: Let D = {(x1,y1), ..., (xn,yn)} and

R = max || x i || i If there is (w*,b*) such that ||w*|| = 1 and i T y (w * x i + b *)> γ , ∀i then the Perceptron will find an error free hyper-plane in 2 at most ⎛ 2R ⎞ ⎜ ⎟ iterations ⎝ γ ⎠ this result was the start of learning theory for the first time there was a proof that a learning machine could actually learn something!

26 The margin

note that y=1 w T y i (w * x i + b *)≥ γ , ∀i will hold if and only if w T x wT i + b x γ = min = min i + b i w i y=-1

which is how we defined the margin 2 this says that the bound on time to convergence ⎛ 2R ⎞ ⎜ ⎟ is inversely proportional to the margin ⎝ γ ⎠ even in this early result, the margin appears as a measure of the difficulty of the learning problem

27 The role of R

scaling the space should not y=1 make a difference as to whether w the problem is solvable γ R accounts for this

if the xi are re-scaled both R and γ are re-scaled and the bound R 2 y=-1 ⎛ 2R ⎞ ⎜ ⎟ ⎝ γ ⎠ remains the same once again, just a question of normalization • illustrates the fact that the normalization ||w||=1 is usually not sufficient

28 Some history

Rosenblatt’s result generated a lot of excitement about learning in the 50s later, Minsky and Papert identified serious problems with the Perceptron • there are very simply logic problems that it cannot solve • more on this on the homework this kill e d o ff the en thus iasm un til an o ld resu lt by Kolmogorov saved the day Theorem: any continuous function g(x) defined on [0,1]n can be represented in the form 2nd+1 ⎛ ⎞ g(x) = ∑Γ j ⎜∑Ψ ij (x i ) ji==11⎝ ⎠

29 Someh(x) history

w x noting that the PerceptronT b can be written as ⎡ n w x ⎤ = sgn[ + ]= sgn⎢∑ +w 0 ⎥ i ⎣ i =1 i ⎦ h (x) this l ook s lik e h av ing tPtwo Percept ron l ayers w x 1: J hyper-planes wj w j , ⎡ n ⎤ j = sgn⎢∑ + 0 ⎥ =1,..., J ji u x ⎣ i =1 i ⎦ layer 2: v j v ⎡ J h (x) ⎤ v ( ) = sgn⎢∑ +v 0 ⎥ ⎣ =1j j ⎦ j w x J J j ⎡ j ⎛ w ⎞ ⎤ ⎜ ⎟ = sgn⎢∑ sgn ∑ + 0 +v 0 ⎥ ⎜ ji ⎟ ⎣⎢ j ==1 ⎝ j 1 i ⎠ ⎦⎥ j 30

j Some history

which can be written as u(x JJ⎛ ⎞ g (x ) = v sgn⎜ w x +w +v ) = sgn[g (x )] with ∑ ⎜∑ 0 ⎟ 0 g(x) j ==1 j ⎝ 1 ⎠ and resembles ji Γ j i 2n +1 d ⎛ Ψ ⎞ j j = ∑ ⎜∑ (x )⎟ j j ==11i ij ⎝ i ⎠ it suggested the idea that • while one Perceptron is not good enough • maybe a multi-layered Perceptron (MLP) will work a lot of work on MLPs ensued under the name of neural networks eventllittually, it was s hown thtthat mostft functi ons can b e approximated by MLPs

31 Graphical representation

the Perceptron is usually represented as

input units: coordinates of x h weights:xcoordinates of w w x T bias term homogeneous coordinates:i x = (x,1) i i w

⎛ ⎞ w T ( ) = sgn⎜∑ + 0 ⎟ = sgn()x ⎝ ⎠

32 Sigmoids

the sgn[.] function is problematic in two ways: • no at 0 f(x) • non-smooth approximations it can be approximated in various ways for example by the hyperbolic tangent x σx f’(x) −σx e σx −e f ( ) = tanh(σx ) = e +e −σx σ controls the approximation error, but f’’(x) • has derivative everywhere • smooth neural networks are implemented with these functions

33 Neural network

the MLP as function approximation

34 35