The Perceptron

The Perceptron Nuno Vasconcelos ECE Depp,artment, UCSD Classification a classification problem has two types of variables • X - vector of observations (features) in the world • Y - state (class) of the world e.g. • x ∈ X ⊂ R2 = (fever, blood pressure) • y ∈ Y = {disease, no disease} X, Y related by a (unknown) function x y = f (x ) f (.) goal: didesign a c lass ifier h: X → Y suchthth that h(x ) = f(x ) ∀x 2 Linear discriminant the classifier implements the linear decision rule ⎧ 1 if g (x ) > 0 h *(x ) = ⎨ = sgn[g(x)] with g (x ) =w T x + b ⎩−1 if g (x ) < 0 has the properties w • it divides X into two “half-spaces” • boundaryyp is the plane with: x • normal w g (x ) b • distance to the origin b/||w|| w w •g(x )/||w || is the dis tance from poi n t x to the boundary • g(x) = 0 for points on the plane • g(x) > 0 on the side w points to (“positive side”) • g(x) < 0 on the “negative side” 3 Linear discriminant the classifier implements the linear decision rule ⎧ 1 if g (x ) > 0 h *(x ) = ⎨ = sgn[g(x)] with g (x ) =w T x + b ⎩−1 if g (x ) < 0 given a linearly separable training set w y=1 D = {(x1,y1), ..., (xn,yn)} no errors ifff and only if, ∀ i x g (x ) • y = 1 and g(x ) > 0 or b i i w yi = -1 and g(xi) < 0 w • i.e. y .g(x ) > 0 i i y=-1 this allows a very concise expression fhifor the situat ion o f“iif “no training error ”“” or “zero emp iilik”irical risk” 4 Learning as optimization necessary and sufficient condition for zero empirical risk T y i (w x i + b )> 0, ∀i this is interesting because it allows the formulation of the learn ing pro blem as one o f func tion op tim iza tion • starting from a random guess for the parameters w and b • we maximize the reward function n T ∑y i ()w x i + b i =1 • or, equivalently, minimize the cost function n T J (w ,b ) = −∑y i (w x i + b ) i =1 5 The gradient we have seen that the gradient of a function f(w) at z is T ⎛ ∂f ∂f ⎞ ∇f (z) = ⎜ (z),L, (z)⎟ ∇f ⎝ ∂w0 ∂wn−1 ⎠ Theorem: the gradient points in the direction of maximum growth f(x,y) gradient is • the direction of greatest increase of f(x) at z, (*) • normal to the iso -contours of f(. ) ∇f (x0, y0 ) ∇f (x1, y1) 6 Critical point conditions let f(x) be continuously differentiable x* is a local minimum of f(x) if and only if • f has zero gradient at x* ∇f (x *) = 0 • and the Hessian of f at x* is positive definite d t ∇2f (x *)d ≥ 0, ∀d ∈ℜn • where ⎡ ∂ 2f ∂ 2f ⎤ ⎢ 2 (x ) L (x )⎥ ∂x 0 ∂x 0∂x n −1 2 ⎢ ⎥ ∇ f (x ) = ⎢ M ⎥ ⎢ ∂ 2f ∂ 2f ⎥ ⎢ (x ) L 2 (x ) ⎥ ⎣∂x n −1∂x 0 ∂x n −1 ⎦ 7 Gradient descent this suggest a simple minimization technique • pick initial estimate x(()0) f(x) • follow the negative gradient x (n +1) = x (n ) −η∇f (x (n ) ) −η∇f (x (n ) ) this is gradient descent x (n ) η is the learning rate and needs to be carefully chosen • if η too large, descent may diverge f(x) many extensions are possible main point: • once framed as optimization, (n ) −η∇f (x ) we can (in general) solve it x (n ) 8 The perceptron this was the main insight of Rosenblatt, which lead to the Perceptron the basic idea is to do gradient descent on our cost n T J ()w ,b = −∑y i (w x i + b ) i =1 we know that: • if the training set is linearly separable there is at least a pair (w,b) su ch that J(w, b) < 0 • any minimum that is equal to or better than this will do Q: can we find one such minimum? 9 Perceptron learning the gradient is straightforward to compute ∂f ∂f = −∑y i x i = −∑y i ∂w i ∂b i and gradient descent is trivial there is, however, one problem: • J((,)w,b) is not bounded below • if J(w,b) < 0, can make J → −∞ by multiplying w and b by λ > 0 • the minimum is always at −∞ which is quite bad, numerically this is really just the normalization problem that we already talked about 10 Rosenblatt’s idea restrict attention to the points incorrectly classified at each iteration define set of errors T E = {x i | y i (w x i + b ) < 0} and make the cost T J p (w ,b ) = − ∑y i (w x i + b ) i |x i ∈E note that T • Jp cannot be negative since, in E, all yi(w xi+b) are negative • if we gg,et to zero, we know we have the best possible solution (E empty) 11 Perceptron learning is trivial, just do gradient descent on Jp(w,b) (n +1) (n ) w =w +η ∑y i x i x i ∈E (n +1) (n ) b = b +η ∑y i x i ∈E this turns out not to be very effective if the D is large • lthtitiitttkllttthdloop over the entire training set to take a small step at the end one alternative that frequently is better is “stochastic gradient descent” • take the step immediately after each point • no guarantee this is a descent step but, on average, you follow the same directi on a fter process ing en tire D • very popular in learning, where D is usually large 12 Perceptron learning the algorithm is as follows: set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { for i = 1:n { T if yi(w xi + bk) < 0 then { – wk+1 = wk + η yi xi 2 we will talk about – bk+1 = bk + η yi R – k=k+1k = k+1 R shortly! } } T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 13 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 14 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 15 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 wk k+1 k i o o y=-1 o o o o wk+ ηyixi – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 16 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 wk+1 o o o o – k=k+1k = k+1 o o o o } o o } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 17 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 wk+1 o o o o – k=k+1k = k+1 o o o o o o } 2 bk +ηyiR } bk x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 18 Perceptron learning does this make sense? consider the example below set k = 0, wk = 0, bk = 0 set R = maxi ||xi|| do { x2 y=1 for i = 1:n { x x x x x x T o if y (w x + b ) < 0 then { x x i i k x x xi x – wk+1 = wk + η yi xi x – b = b + η y R2 k+1 k i o o y=-1 o o o wk+1 o – k=k+1k = k+1 o o o o } o o } bk+1 x1 T } until yi(w xi + bk) ≥ 0, ∀ i (no errors) 19 Perceptron learning OK, makes intuitive sense how do we know it will not get stuck on a local minimum? this was Rosenblatt’s seminal contribution Theorem: Let D = {(x{( 1,y1), ..., (x ( n,yn)} and (*) R = max || x i || i If there is (w *,b *) such that ||w *|| = 1 and T y i (w * x i + b *)> γ , ∀i (**) then the Perceptron will find an error free hyper-plane in at most 2 ⎛ 2R ⎞ ⎜ ⎟ iterations ⎝ γ ⎠ 20 Proof not that hard denote iteration by t, assume point processed at iteration t-1 is (xi,yi) for simppy,licity, use homogeneous coordinates.

Load more