<<

Supervised : The Setup

The

Machine LearningMachine Learning Spring 2018 Fall 2017

1 The slides are mainly from Vivek Srikumar

1 Outline

• The Perceptron Algorithm

• Perceptron Mistake Bound

• Variants of Perceptron

2 Where are we?

• The Perceptron Algorithm

• Perceptron Mistake Bound

• Variants of Perceptron

3 Recall: Linear Classifiers

• Input is a n dimensional vector x • Output is a label y ∈ {-1, 1}

• Linear Threshold Units (LTUs) classify an example x using the following classification rule

T – Output = sgn(w x + b) = sgn(b +åwi xi)

– wTx + b ≥ 0 → Predict y = 1 – wTx + b < 0 → Predict y = -1 b is called the bias term

4 Recall: Linear Classifiers

• Input is a n dimensional vector x • Output is a label y ∈ {-1, 1}

• Linear Threshold Units (LTUs) classify an example x using the following classification rule

sgn T – Output = sgn(w x + b) = sgn(b +åwi xi)

∑ – wTx + b ¸ 0 ) Predict y = 1 T – w x + b < 0 )&Predict' &( &y )= -&1* &+ &, &- &. 1 b is called /the' /( term/) /* /+ /, /- /. 1

5 The geometry of a sgn(b +w x + w x ) 1 1 2 2 We only care about the sign, not the magnitude b +w1 x1 + w2x2=0 + + ++ + + + +

[w1 w2]

x1 - -- - In n dimensions, ------a linear classifier - -- represents a -- that separates the space into two half-spaces

x2 6 The Perceptron

7 The hype

The New Yorker, December 6, 1958 P. 44

8 , July 8 1958 The IBM 704 The Perceptron algorithm

• Rosenblatt 1958

• The goal is to find a separating hyperplane – For separable , guaranteed to find one

• An online algorithm – Processes one example at a time

• Several variants exist (will discuss briefly at towards the end)

9 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1}

n • Initialize w0 = 0 ∈ ℜ

• For each training example (xi, yi): T – Predict y’ = sgn(wt xi)

– If yi ≠ y’:

• Update wt+1 ←wt + r (yi xi) • Return final weight vector

10 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1} Remember: = sgn(wTx) n • Initialize w0 = 0 ∈ ℜ There is typically a bias term • For each training example (xi, yi): also (wTx + b), but the bias T – Predict y’ = sgn(wt xi) may be treated as a constant feature and folded – If yi ≠ y’: into w • Update wt+1 ←wt + r (yi xi) • Return final weight vector

11 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1} Remember: Prediction = sgn(wTx) n • Initialize w0 = 0 ∈ ℜ • There is typically a bias term For each training example (xi, yi): also (wTx + b), but the bias T – Predict y’ = sgn(wt xi) may be treated as a constant feature and folded – If yi ≠ y’: into w • Update wt+1 ←wt + r (yi xi) • Return final weight vector

Footnote: For some it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true. 12 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1}

Mistake on positive: wt+1 ← wt + r xi n Mistake on negative: wt+1 ← wt - r xi • Initialize w0 = 0 ∈ ℜ

• For each training example (xi, yi): T – Predict y’ = sgn(wt xi)

– If yi ≠ y’:

• Update wt+1 ←wt + r (yi xi) • Return final weight vector

13 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1}

Mistake on positive: wt+1 ← wt + r xi n Mistake on negative: wt+1 ← wt - r xi • Initialize w0 = 0 ∈ ℜ

• For each training example (xi, yi): r is the , a small positive T number less than 1 – Predict y’ = sgn(wt xi)

– If yi ≠ y’:

• Update wt+1 ←wt + r (yi xi) • Return final weight vector

14 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1}

Mistake on positive: wt+1 ← wt + r xi n Mistake on negative: wt+1 ← wt - r xi • Initialize w0 = 0 ∈ ℜ

• For each training example (xi, yi): r is the learning rate, a small positive T number less than 1 – Predict y’ = sgn(wt xi)

– If yi ≠ y’: • Update w ←w + r (y x ) Update only on error. A mistake- t+1 t i i driven algorithm • Return final weight vector

15 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1}

Mistake on positive: wt+1 ← wt + r xi n Mistake on negative: wt+1 ← wt - r xi • Initialize w0 = 0 ∈ ℜ

• For each training example (xi, yi): r is the learning rate, a small positive T number less than 1 – Predict y’ = sgn(wt xi)

– If yi ≠ y’: • Update w ←w + r (y x ) Update only on error. A mistake- t+1 t i i driven algorithm • Return final weight vector This is the simplest version. We will see more robust versions at the end

16 The Perceptron algorithm

Input: A sequence of training examples (x1, y1), (x2, y2),! n where all xi ∈ ℜ , yi ∈ {-1,1}

Mistake on positive: wt+1 ← wt + r xi n Mistake on negative: wt+1 ← wt - r xi • Initialize w0 = 0 ∈ ℜ

• For each training example (xi, yi): r is the learning rate, a small positive T number less than 1 – Predict y’ = sgn(wt xi)

– If yi ≠ y’: • Update w ←w + r (y x ) Update only on error. A mistake- t+1 t i i driven algorithm • Return final weight vector This is the simplest version. We will see more robust versions at the end

T Mistake can be written as yiwt xi ≤ 0 17 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Intuition behind the update

Suppose we have made a mistake on a positive example T That is, y = +1 and wt x <0

Call the new weight vector wt+1 = wt + x (say r = 1)

T T T T T The new dot product will be wt+1 x = (wt + x) x = wt x + x x > wt x

For a positive example, the Perceptron update will increase the score assigned to the same input

Similar reasoning for negative examples

18 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Geometry of the perceptron update Predict

wold

19 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Geometry of the perceptron update Predict

wold

(x, +1)

20 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Geometry of the perceptron update Predict

wold

(x, +1)

For a mistake on a positive example

21 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Geometry of the perceptron update Predict Update

w ← w + x wold

(x, +1) (x, +1)

For a mistake on a positive example

22 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Geometry of the perceptron update Predict Update

w ← w + x wold x

(x, +1) (x, +1)

For a mistake on a positive example

23 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Geometry of the perceptron update Predict Update

w ← w + x wold x

(x, +1) (x, +1)

For a mistake on a positive example

24 Mistake on positive: wt+1 ← wt + r xi Mistake on negative: wt+1 ← wt - r xi Geometry of the perceptron update Predict Update After

w ← w + x wold x wnew

(x, +1) (x, +1) (x, +1)

For a mistake on a positive example

25 Geometry of the perceptron update Predict

wold

26 Geometry of the perceptron update Predict

(x, -1) wold

For a mistake on a negative example

27 Geometry of the perceptron update Predict Update

w ← w - x

(x, -1) (x, -1) - x wold

For a mistake on a negative example

28 Geometry of the perceptron update Predict Update

w ← w - x

(x, -1) (x, -1) - x wold

For a mistake on a negative example

29 Geometry of the perceptron update Predict Update

w ← w - x

(x, -1) (x, -1) - x wold

For a mistake on a negative example

30 Geometry of the perceptron update Predict Update After

w ← w - x

w (x, -1) - x new (x, -1) (x, -1) wold

For a mistake on a negative example

31 Perceptron Learnability

• Obviously Perceptron cannot learn what it cannot represent – Only linearly separable functions

• Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations

– Parity functions can’t be learned (XOR) • But we already know that XOR is not linearly separable – Feature transformation trick

32 What you need to know

• The Perceptron algorithm

• The geometry of the update

• What can it represent

33 Where are we?

• The Perceptron Algorithm

• Perceptron Mistake Bound

• Variants of Perceptron

34 Convergence

Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge.

35 Convergence

Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge.

Cycling theorem – If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop

36 Margin

The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

+ - -- + ++ + - - - + + + ------Margin with respect to this hyperplane --

37 Margin

• The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it.

• The margin of a data set (!) is the maximum margin possible for that dataset using any weight vector.

+ - -- + ++ + - - - + + + ------Margin of the data

38 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}.

Suppose there exists a unit vector u 2 0 we have yi (u xi) ¸ °.

Then, the perceptron algorithm will make at most (R/°)2 mistakes on the training sequence.

39 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}.

Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that T for some $ ∈ ℜ and $ > 0 we have yi (u xi) ≥ $.

Then, the perceptron algorithm will make at most (R/°)2 mistakes on the training sequence.

40 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}.

Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that T for some $ ∈ ℜ and $ > 0 we have yi (u xi) ≥ $.

Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence.

41 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}. We can always find such an R. Just look for the farthest data point from the origin. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that T for some $ ∈ ℜ and $ > 0 we have yi (u xi) ≥ $.

Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence.

42 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}.

Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that T for some ! ∈ ℜ and ! > 0 we have yi (u xi) ≥ !.

Then, the perceptron algorithm will make at most (R/ !)2 mistakes on the training sequence.

The data and u have a margin !. Importantly, the data is separable. ! is the parameter that defines the separability of data. 43 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}.

Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that T for some ! ∈ ℜ and ! > 0 we have yi (u xi) ≥ !.

Then, the perceptron algorithm will make at most (R/ !)2 mistakes on the training sequence. If u hadn’t been a unit vector, then we could scale ! in the mistake bound. This will change the final mistake bound to (||u||R/ !)2 44 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}. Suppose we have a dataset with n dimensional inputs. Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that T for some $ ∈ ℜ and $ > 0 we have yi (u xi) ≥ $. If the data is separable,… Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence. …then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes

45 • Receive an input (xi, yi) T • if sgn(wt xi) ≠ yi: ← Proof (preliminaries) Update wt+1 wt + yi xi

The setting • Initial weight vector w is all zeros

• Learning rate = 1 – Effectively scales inputs, but does not change the behavior

• All training examples are contained in a ball of size R

– ||xi|| ≤ R

• The training data is separable by margin " using a unit vector u T – yi (u xi) ≥ "

46 • Receive an input (xi, yi) T • if sgn(wt xi) ≠ yi: ← Proof (1/3) Update wt+1 wt + yi xi

T 1. Claim: After t mistakes, u wt ≥ t "

47 • Receive an input (xi, yi) T • if sgn(wt xi) ≠ yi: ← Proof (1/3) Update wt+1 wt + yi xi

T 1. Claim: After t mistakes, u wt ≥ t "

Because the data is separable by a margin "

48 • Receive an input (xi, yi) T • if sgn(wt xi) ≠ yi: ← Proof (1/3) Update wt+1 wt + yi xi

T 1. Claim: After t mistakes, u wt ≥ t "

Because the data is separable by a margin "

T Because w0 = 0 (i.e u w0 = 0), straightforward T induction gives us u wt ≥ t "

49 • Receive an input (xi, yi) T • if sgn(wt xi) ≠ yi: ← Proof (2/3) Update wt+1 wt + yi xi

2 2 2. Claim: After t mistakes, ||wt|| ≤ tR

50 • Receive an input (xi, yi) T • if sgn(wt xi) ≠ yi: ← Proof (2/3) Update wt+1 wt + yi xi

2 2 2. Claim: After t mistakes, ||wt|| ≤ tR

The weight is updated only ||xi|| ≤ R, by definition of R when there is a mistake. That is T when yi wt xi < 0.

51 • Receive an input (xi, yi) T • if sgn(wt xi) ≠ yi: ← Proof (2/3) Update wt+1 wt + yi xi

2 2 2. Claim: After t mistakes, ||wt|| ≤ tR

T Because w0 = 0 (i.e u w0 = 0), straightforward induction 2 2 gives us ||wt|| ≤ tR

52 Proof (3/3)

What we know: T 1. After t mistakes, u wt ≥ t" 2 2 2. After t mistakes, ||wt|| ≤ tR

53 Proof (3/3)

What we know: T 1. After t mistakes, u wt ≥ t" 2 2 2. After t mistakes, ||wt|| ≤ tR

From (2)

54 Proof (3/3)

What we know: T 1. After t mistakes, u wt ≥ t" 2 2 2. After t mistakes, ||wt|| ≤ tR

From (2)

T u wt =||u|| ||wt|| cos()

But ||u|| = 1 and cosine is less than 1

T So u wt · ||wt|| 55 Proof (3/3)

What we know: T 1. After t mistakes, u wt ≥ t" 2 2 2. After t mistakes, ||wt|| ≤ tR

From (2)

T u wt =||u|| ||wt|| cos()

But ||u|| = 1 and cosine is less than 1

T So u wt · ||wt|| (Cauchy-Schwarz inequality) 56 Proof (3/3)

What we know: T 1. After t mistakes, u wt ≥ t# 2 2 2. After t mistakes, ||wt|| ≤ tR

From (2) From (1)

T u wt =||u|| ||wt|| cos()

But ||u|| = 1 and cosine is less than 1

T So u wt ≤ ||wt|| 57 Proof (3/3)

What we know: T 1. After t mistakes, u wt ≥ t" 2 2 2. After t mistakes, ||wt|| ≤ tR

Number of mistakes

58 Proof (3/3)

What we know: T 1. After t mistakes, u wt ≥ t" 2 2 2. After t mistakes, ||wt|| ≤ tR

Number of mistakes

Bounds the total number of mistakes!

59 Mistake Bound Theorem [Novikoff 1962, Block 1962]

Let {(x1, y1), (x2, y2),!, (xm, ym)} be a sequence of training n examples such that for all i, the feature vector xi ∈ ℜ , ||xi|| ≤ R and the label yi ∈ {-1, +1}.

Suppose there exists a unit vector u ∈ ℜn (i.e ||u|| = 1) such that T for some $ ∈ ℜ and $ > 0 we have yi (u xi) ≥ $.

Then, the perceptron algorithm will make at most (R/ $)2 mistakes on the training sequence.

60 Mistake Bound Theorem [Novikoff 1962, Block 1962]

61 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 170 1 1 1 L2( , B, )= log KBB log KBB + A1 a2 + a3 171 U 2 | | 2 | | 2 K 172 1 1 1 1 (k) 2 >K + tr(K A ) U 173 2 BB 2 BB 1 2 k kF 174 kX=1 175 (t+1) 1 (t) 176 =(KBB + A1) (A1 + a6) 177 (t) 178 k(B, xi )> 0, 1 a = k(B, x )(2y 1) N j | 179 6 ij ij (t) j (2yij 1)k(B, xij )> 180 X 181 p(v B) 182 L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U U q(v) j j 183 Z X Z 184 L y, , B, q(v) 1 U 185 B, 186 U L1⇤(y, , B) 187 U L y, , B, q(v) 188 1 U 189 (A1, A1) 190 r (a2, a2) 191 r (a , a ) 192 3 r 3 (a , a ) 193 4 r 4 194 (a , a ) 5 r 4 195 L (y, , B, q(v)) = log(p( )) + H q(v) 196 1 U U 197 + log (v 0,k(B, B) 198 N | Z 199 + q(v)Fv(yi , )dv 200 j j 201 X Z 202 k 203 sgn c sgn(w>x) (2) i · i 204 i=1 X 205 206 Mistake Boundk Theorem [Novikoff 1962, Block 1962] 207 = sgn ciwi>x (3) 208 i=1 X 209 210 yiw>xi 211 < ⌘ (4) w 212 k k 213

214 s1 + s2 + s3 0 (5) 215 R

4

62 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 1 1 1 170 162L ( , B, )= log K 1 log1 K + A 1 a +1 a 2 L1( , B)=BBlog KBB log BBKBB + A11 a2 2 a33 171 163 U 2 U | 2 | | 2| 2 | | ||2 2 2 164 K 1 1 K 172 1 1(k) 2 12 1 1 165 U F + a4>(KBB + A1) a4 (k) 2 >KBB 2 +k tr(k K2BB A1) U F 173 166 2 k=1 2 2 k k X k=1 174 167 1 X + tr(K A1) + CONST 175 168 2 BB 169 176 (t+1) 1 (t) 170 =(KBB 1+ A1) (A1 1 + a6) 1 L2( , B, )= log KBB log KBB + A1 a2 + a3 177 171 U 2 | | 2 | | 2 K 178 172 (t) 1 k1(B, x1ij )>1 0, 1(k) 2 >KBBN + tr(KBB A1) | U F 179 173 a6 = k(B, xij )(2yij 2 1) 2 2 k (tk) k=1 174 (2yij 1)k(B, xij )> 180 j X 175 X 181 (t+1) 1 (t) 176 =(KBBp(v+ AB1) (A1 + a6) 182 L1( , B177,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv j j U 178 U q(v) (t) 183 k(B, xij )> 0, 1 a = Z k(B, x )(2y 1) N Z | 179 6 ij ij X (t) 184 Lj 1 y, , B, q(v) (2yij 1)k(B, xij )> 180 X U 185 181 B, U p(v B) 186 182 L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U L1⇤U(y, , B) q(v) j j 183 187 UZ X Z 184 L1 y, , BL , qy(, v,)B, q(v) 188 U 1 U 185 B, 189 U 186 (A1, A1) L1⇤(y, , B) 190 187 r U (a2, aL21 )y, , B, q(v) 191 188 r U 189 (a3, a3) 192 (A1, A1) 190 r r 193 (a4, a4()a2, a2) 191 r r (a , a ) 194 192 (a5, a4) 3 r 3 r (a , a ) 195 193 4 r 4 194 (a , a ) 196 L1(y, , B, q(v)) = log(p5( r))4 + H q(v) 195 U U 197 L (y, , B, q(v)) = log(p( )) + H q(v) 196 + log (v1 0,kU (B, B) U 198 197 N +| log (v 0,k(B, B) 199 198 Z N | Z 199 + q(v)Fv(yi , )dv 200 j j 200 + q(v)Fv(yij , )dv 201 Z j 201 X X Z 202 202 k k 203 203 sgn c sgnsgn(wci sgn(x)wi>x) (2) i ·i> (2) 204 · i=1 204 i=1 X 205 X 205 206 k 206 k [Novikoff 1962, Block 1962] 207 Mistake Bound Theorem= sgn ciw i>x (3) 207 208 i=1 = sgn ciwi>Xx (3) 209 208 i=1 210 X 209 yiw>xi 211 < ⌘ (4) w 210 212 yiw>xi k k 211 213 < ⌘ (4) w 212 214 k k s1 + s2 + s3 0 (5) 215 213

214 s1 + s2 + s3 0 (5) 215 R 4

4

63 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 1 1 1 170 162L ( , B, )= log K 1 log1 K + A 1 a +1 a 2 L1( , B)=BBlog KBB log BBKBB + A11 a2 2 a33 171 163 U 2 U | 2 | | 2| 2 | | ||2 2 2 164 K 1 1 K 172 1 1(k) 2 12 1 1 165 U F + a4>(KBB + A1) a4 (k) 2 >KBB 2 +k tr(k K2BB A1) U F 173 166 2 k=1 2 2 k k X k=1 174 167 1 X + tr(K A1) + CONST 175 168 2 BB 169 176 (t+1) 1 (t) 170 =(KBB 1+ A1) (A1 1 + a6) 1 L2( , B, )= log KBB log KBB + A1 a2 + a3 177 171 U 2 | | 2 | | 2 K 178 172 (t) 1 k1(B, x1ij )>1 0, 1(k) 2 >KBBN + tr(KBB A1) | U F 179 173 a6 = k(B, xij )(2yij 2 1) 2 2 k (tk) k=1 174 (2yij 1)k(B, xij )> 180 j X 175 X 181 (t+1) 1 (t) 176 =(KBBp(v+ AB1) (A1 + a6) 182 L1( , B177,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv j j U 178 U q(v) (t) 183 k(B, xij )> 0, 1 a = Z k(B, x )(2y 1) N Z | 179 6 ij ij X (t) 184 Lj 1 y, , B, q(v) (2yij 1)k(B, xij )> 180 X U 185 181 B, U p(v B) 186 182 L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U L1⇤U(y, , B) q(v) j j 183 187 UZ X Z 184 L1 y, , BL , qy(, v,)B, q(v) 188 U 1 U 185 B, 189 U 186 (A1, A1) L1⇤(y, , B) 190 187 r U (a2, aL21 )y, , B, q(v) 191 188 r U 189 (a3, a3) 192 (A1, A1) 190 r r 193 (a4, a4()a2, a2) 191 r r (a , a ) 194 192 (a5, a4) 3 r 3 r (a , a ) 195 193 4 r 4 194 (a , a ) 196 L1(y, , B, q(v)) = log(p5( r))4 + H q(v) 195 U U 197 L (y, , B, q(v)) = log(p( )) + H q(v) 196 + log (v1 0,kU (B, B) U 198 197 N +| log (v 0,k(B, B) 199 198 Z N | Z 199 + q(v)Fv(yi , )dv 200 j j 200 + q(v)Fv(yij , )dv 201 Z j 201 X X Z 202 202 k k 203 203 sgn c sgnsgn(wci sgn(x)wi>x) (2) i ·i> (2) 204 · i=1 204 i=1 X 205 X 205 206 k 206 k [Novikoff 1962, Block 1962] 207 Mistake Bound Theorem= sgn ciw i>x (3) 207 208 i=1 = sgn ciwi>Xx (3) 209 208 i=1 210 X 209 yiw>xi 211 < ⌘ (4) w 210 212 yiw>xi k k 211 213 < ⌘ (4) w 212 214 k k s1 + s2 + s3 0 (5) 215 213

214 s1 + s2 + s3 0 (5) 215 R 4

4 Number of mistakes

64 The Perceptron Mistake bound Number of mistakes

• Exercises: – How many mistakes will the Perceptron algorithm make for disjunctions with n attributes? • What are R and !? – How many mistakes will the Perceptron algorithm make for k- disjunctions with n attributes?

65 What you need to know

• What is the perceptron mistake bound?

• How to prove it

66 Where are we?

• The Perceptron Algorithm

• Perceptron Mistake Bound

• Variants of Perceptron

67 Practical use of the Perceptron algorithm

1. Using the Perceptron algorithm with a finite dataset

2. Margin Perceptron

3. Voting and Averaging

68 1. The “standard” algorithm

n Given a training set D = {(xi, yi)}, xi ∈ ℜ , yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn 2. For epoch = 1 … T: 1. Shuffle the data

2. For each training example (xi, yi) ∈ D: T • If yi w xi ≤ 0, update w ← w + r yi xi 3. Return w

Prediction: sgn(wTx)

69 1. The “standard” algorithm

n Given a training set D = {(xi, yi)}, xi ∈ ℜ , yi ∈ {-1,1} 1. Initialize w = 0 ∈ ℜn T is a hyper-parameter to the algorithm 2. For epoch = 1 … T: 1. Shuffle the data

2. For each training example (xi, yi) ∈ D: T • If yi w xi ≤ 0, update w ← w + r yi xi 3. Return w Another way of writing that there is an error Prediction: sgn(wTx)

70 2. Margin Perceptron

Which hyperplane is better?

h1

h2

71 2. Margin Perceptron

Which hyperplane is better?

h1

h2

72 2. Margin Perceptron

Which hyperplane is better?

h1

h2

The farther from the data points, the less chance to make wrong prediction

73 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 170 1 1 1 L ( , B, )= log K log K + A a + a 171 2 U 2 | BB| 2 | BB 1| 2 2 3 172 K 1 1 1 1 (k) 2 173 >K + tr(K A ) U 2 BB 2 BB 1 2 k kF 174 kX=1 175 176 (t+1) 1 (t) 177 =(KBB + A1) (A1 + a6) 178 (t) 179 k(B, xi )> 0, 1 a = k(B, x )(2y 1) N j | 180 6 ij ij (t) j (2yij 1)k(B, xij )> 181 X 182 183 p(v B) L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U U q(v) j j 184 Z Z 185 X L1 y, , B, q(v) 186 U B, 187 U 188 L⇤(y, , B) 1 U 189 L1 y, , B, q(v) 190 U 191 (A1, A1) 192 r (a2, a2) 193 r (a , a ) 194 3 r 3 (a , a ) 195 4 r 4 196 (a5, a4) r 197

198 L1(y, , B, q(v)) = log(p( )) + H q(v) 199 U U + log (v 0,k(B, B) 200 N | 201 2.Z Margin Perceptron 202 + q(v)Fv(yi , )dv j j 203 X Z 204 • Perceptronk makes updates only when the prediction is 205 incorrectsgn ci sgn(wi>x) (2) 206 · i=1 207 X T yi w xi < 0 208 209 k 210 • What =if sgnthe predictionciwi>x is close to being incorrect?(3) That is, 211 i=1 X 212 Pick a positive ! and update when 213 y w>x 214 i i < ⌘ (4) w 215 k k • Can generalize better, but need to choose – Why is this a4 good idea?

74 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 170 1 1 1 L ( , B, )= log K log K + A a + a 171 2 U 2 | BB| 2 | BB 1| 2 2 3 172 K 1 1 1 1 (k) 2 173 >K + tr(K A ) U 2 BB 2 BB 1 2 k kF 174 kX=1 175 176 (t+1) 1 (t) 177 =(KBB + A1) (A1 + a6) 178 (t) 179 k(B, xi )> 0, 1 a = k(B, x )(2y 1) N j | 180 6 ij ij (t) j (2yij 1)k(B, xij )> 181 X 182 183 p(v B) L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U U q(v) j j 184 Z Z 185 X L1 y, , B, q(v) 186 U B, 187 U 188 L⇤(y, , B) 1 U 189 L1 y, , B, q(v) 190 U 191 (A1, A1) 192 r (a2, a2) 193 r (a , a ) 194 3 r 3 (a , a ) 195 4 r 4 196 (a5, a4) r 197

198 L1(y, , B, q(v)) = log(p( )) + H q(v) 199 U U + log (v 0,k(B, B) 200 N | 201 2.Z Margin Perceptron 202 + q(v)Fv(yi , )dv j j 203 X Z 204 • Perceptronk makes updates only when the prediction is 205 incorrectsgn ci sgn(wi>x) (2) 206 · i=1 207 X T yi w xi < 0 208 209 k 210 • What =if sgnthe predictionciwi>x is close to being incorrect?(3) That is, 211 i=1 X 212 Pick a positive ! and update when 213 y w>x 214 i i < ⌘ (4) w 215 k k • Can generalize better, but need to choose – Why is this a4 good idea? Intentionally set a large margin

75 3. Voting and Averaging

What if data is not linearly separable?

Finding a hyperplane with minimum mistakes is NP hard

76 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 170 1 1 1 L ( , B, )= log K log K + A a + a 171 2 U 2 | BB| 2 | BB 1| 2 2 3 172 K 1 1 1 1 (k) 2 173 >K + tr(K A ) U 2 BB 2 BB 1 2 k kF 174 kX=1 175 176 (t+1) 1 (t) 177 =(KBB + A1) (A1 + a6) 178 (t) 179 k(B, xi )> 0, 1 a = k(B, x )(2y 1) N j | 180 6 ij ij (t) j (2yij 1)k(B, xij )> 181 X 182 183 p(v B) L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U U q(v) j j 184 Z Z 185 X L1 y, , B, q(v) 186 U B, 187 U 188 L⇤(y, , B) 1 U 189 L1 y, , B, q(v) 190 U 191 (A1, A1) 192 r (a2, a2) 193 Voted Perceptronr (a3, a3) 194 Given a training set D = r{(x , y )}, x ∈ ℜn, y ∈ {-1,1} (a4, ai4)i i i 195 1. Initialize w = 0 ∈ ℜnr and a = 0 ∈ ℜn 196 (a5, a4) r 197 2. For epoch = 1 … T: 198 –L1(y, , B, q(v)) = log(p( ))∈ + H q(v) ForU each training example (xUi, yi) D: 199 • If y wTx ≤ 0 i i 200 + log– update( wvm+10←,kw(mB+ r, yBi xi) – m=m+1N | 201 Z – Cm = 1 • Else 202 + q(v)Fv(yij , )dv –j C = C + 1 203 m m 3. ReturnX Z(w , c ), (w , c ), …, (w , C ) 204 1 1 2 2 k k k 205 Prediction: sgn ci sgn(w>x) (2) 206 · i i=1 207 X 77 208 209 210 211 212 213 214 215

4 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 170 1 1 1 L ( , B, )= log K log K + A a + a 171 2 U 2 | BB| 2 | BB 1| 2 2 3 172 K 1 1 1 1 (k) 2 173 >K + tr(K A ) U 2 BB 2 BB 1 2 k kF 174 kX=1 175 176 (t+1) 1 (t) 177 =(KBB + A1) (A1 + a6) 178 (t) 179 k(B, xi )> 0, 1 a = k(B, x )(2y 1) N j | 180 6 ij ij (t) j (2yij 1)k(B, xij )> 181 X 182 183 p(v B) L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U U q(v) j j 184 Z Z 185 X L1 y, , B, q(v) 186 U B, 187 U 188 L⇤(y, , B) 1 U 189 L1 y, , B, q(v) 190 U 191 (A1, A1) 192 r (a2, a2) 193 r (a , a ) 194 Averaged Perceptron3 r 3 (a , a ) 195 4 r 4 196 (a5, a4) r 197 n Given a training set D = {(xi, yi)}, xi ∈ ℜ , yi ∈ {-1,1} 198 L (y, , B, q(v)) = log(p( )) + H q(v) 1 n n 199 1. Initialize w = 0U ∈ ℜ and a = 0 ∈U ℜ + log (v 0,k(B, B) 200 N | 201 Z 2. For epoch = 1 … T: 202 + q(v)Fv(yi , )dv j j 203 – For each trainingZ example (xi, yi) ∈ D: T X 204 • If yi w xi ≤ 0 k 205 – update w ← w + r yi xi sgn c sgn(w>x) (2) 206 • a ← a + w i · i i=1 207 3. Return a X 208 209 k T 210 Prediction: sgn(a x) = sgn ciwi>x (3) 211 i=1 X 212 78 213 214 215

4 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 170 1 1 1 L ( , B, )= log K log K + A a + a 171 2 U 2 | BB| 2 | BB 1| 2 2 3 172 K 1 1 1 1 (k) 2 173 >K + tr(K A ) U 2 BB 2 BB 1 2 k kF 174 kX=1 175 176 (t+1) 1 (t) 177 =(KBB + A1) (A1 + a6) 178 (t) 179 k(B, xi )> 0, 1 a = k(B, x )(2y 1) N j | 180 6 ij ij (t) j (2yij 1)k(B, xij )> 181 X 182 183 p(v B) L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U U q(v) j j 184 Z Z 185 X L1 y, , B, q(v) 186 U B, 187 U 188 L⇤(y, , B) 1 U 189 L1 y, , B, q(v) 190 U 191 (A1, A1) 192 r (a2, a2) 193 r (a , a ) 194 Averaged Perceptron3 r 3 (a , a ) 195 4 r 4 196 (a5, a4) r 197 n Given a training set D = {(xi, yi)}, xi ∈ ℜ , yi ∈ {-1,1} 198 L (y, , B, q(v)) = log(p( )) + H q(v) 1 n n 199 1. Initialize w = 0U ∈ ℜ and a = 0 ∈Uℜ 200 + log (v 0,k(B, B) This is the simplest version of N | 201 Z the averaged perceptron 2. For epoch = 1 … T: 202 + q(v)Fv(yi , )dv j j There are some easy 203 – For each trainingZ example (xi, yi) ∈ D: T X programming tricks to make 204 • If yi w xi ≤ 0 k sure that a is also updated 205 – update w ← w + r yi xi sgn c sgn(w>x) only when there is an error (2) 206 • a ← a + w i · i i=1 207 3. Return a X 208 209 k T 210 Prediction: sgn(a x) = sgn ciwi>x (3) 211 i=1 X 212 79 213 214 215

4 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 U + a >(K + A ) a 2 F 2 4 BB 1 4 166 k k kX=1 167 1 + tr(K A ) + CONST 168 2 BB 1 169 170 1 1 1 L ( , B, )= log K log K + A a + a 171 2 U 2 | BB| 2 | BB 1| 2 2 3 172 K 1 1 1 1 (k) 2 173 >K + tr(K A ) U 2 BB 2 BB 1 2 k kF 174 kX=1 175 176 (t+1) 1 (t) 177 =(KBB + A1) (A1 + a6) 178 (t) 179 k(B, xi )> 0, 1 a = k(B, x )(2y 1) N j | 180 6 ij ij (t) j (2yij 1)k(B, xij )> 181 X 182 183 p(v B) L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv U U q(v) j j 184 Z Z 185 X L1 y, , B, q(v) 186 U B, 187 U 188 L⇤(y, , B) 1 U 189 L1 y, , B, q(v) 190 U 191 (A1, A1) 192 r (a2, a2) 193 r (a , a ) 194 Averaged Perceptron3 r 3 (a , a ) 195 4 r 4 196 (a5, a4) r 197 n Given a training set D = {(xi, yi)}, xi ∈ ℜ , yi ∈ {-1,1} 198 L1(y, , B, q(v)) = log(p( )) + H q(v) n n 199 1. Initialize w = 0U ∈ ℜ and a = 0 ∈U ℜ 200 + log (v 0,k(B, B) This is the simplest version of N | 201 Z the averaged perceptron 202 2. For epoch+ = 1 … T:q(v)Fv(yi , )dv j j There are some easy 203 – For each training example (x , y ) ∈ D: X Z i i programming tricks to make 204 • If y wTx ≤ 0 i i k sure that a is also updated 205 – update w ← w + r yi xi sgn c sgn(w>x) only when there is an error (2) 206 • a ← a + w i · i i=1 207 3. Return a X Extremely popular 208 209 k If you want to use the T 210 Prediction: sgn(a x) = sgn ciwi>x Perceptron algorithm, use (3) 211 i=1 averaging X 212 80 213 214 215

4 162 1 1 1 1 L ( , B)= log K log K + A a a 163 1 U 2 | BB| 2 | BB 1| 2 2 2 3 164 K 1 (k) 2 1 2 1 165 162 U F + 1 a4>(KBB 1+ A1) a4 1 1 2 kL ( ,kB)=2 log K log K + A a a 166 163 1 BB BB 1 2 3 kX=1 U 2 | | 2 | | 2 2 162 164 K 167 1 11 1 1 11 L1( , B)= log+KBBtr(K logA1K) +BB CONST+ A1 (k)a2 2 2a3 1 163168 165U 2 | | BB2 | |U 2 F + 2 a4>(KBB + A1) a4 2 2 k k 2 164 166 K k=1 169 1 (k) 2 1 2 X 1 165 162 167 U F1+ a4>(KBB1 + A1) a4 1 1 170 162 2 1 2 1 1 1 166 L1( k, B)=k1 log KBB+ 1 tr(logKBBKBBA1+) +A CONST1 1 a2 1 a3 163 168 L2( L,k1B=1( ,U, B)=)= 2loglog KK| BBBB| 22loglogK| BBKBB+ +A1A|1 2a2a2+2aa33 171 163 U XU 22 || | |2 2 | | | |2 2 2 167 164 169 1 K 164 + tr(K A )1 +K CONST 1 168172 BB 11 (k) 2 1 1 2 1 K 1 1 165 170 2 1 U(k) 2 F +1 2 a4>(KBB +1A1)1 a4 165 2 L2(kU , B,kF+)=2 alog4>1(KKBB + Alog1) Ka4BB(k)+2A1 a2 + a3 169173 166 2k>=1Kk BBU k + 2tr(2K | A1)| 2 | U | 2 166 171 k=1 BB F 170 1 2 XX 1 2 1 2 k k K 174 167 L172( , B, )= log K log1 K + A a + a k=1 167 2 BB 1 BB 1 1 2 13 1 1 (k) 2 171 U 2 |++ tr(tr(| KK2BB AA|)1) + + CONST CONST| 2 X 168168 173 2 BB 1 >KBB + tr(KBB A1) U F 175 2 2 K 2 2 k k 172 k=1 169169 174 1 1 1 1 (k) 2 176 >K + tr(K A ) U X 173 170 (t+1)BB 11 BB 1 111 (t) F 1 1 170 175 2 =(K2 BB + A1) (2A1k +ka6) 174177 LL22(( ,,BB,,)=)= loglogKKBBBB loglogkK=1KBBBB++AA1 1 a2a+2 +a3a3 171171 176 UU 22 | | | (|t+1)22 |X| | | 21 2 (t) 178 =(KBB + A1) (A1 + a6) 175 172172 KK 177 11 11 1 1 1 1(t) (k()k)2 2 176 (t+1) >K 1 + (t)tr(K A ) U 179 173173 =(K + A>)KBBBB(A+ +tr(kaK(BB)BBB , xA1i1j))> 0, 1U F F 178 a = k(B,BBx)(222 y1 1)1 2N2 6 2 2 |k k k k (t) 177 174 6 ij ij k=1k=1k(B, xij )> 0, 1 180 174 179 XNX (t) | a6 = k(B,(2xiyji)(2j yi1)j k(1)B, xij )> 178 175175 j (t) (2y 1)k(B, x ) (t) 181 180 X kj (B, xij )> 0, 1 ij ij > 179 176176 a6 = k(B, xij )(2yij (t+1)1) NX 1 | (t) 182 181 (t+1)=( (2KyBB +1)Ak1()B, (x1A1)((tt+)) a6) 180 177177 j =(KiBBj + A1) (ijA>1 + a6) 183 182 X p(v B) p(v B) 181 178 178 L1( , B,q(v)) = log(L1p(( , B)),q +(v))q =(v log() logp( ))| + dqv(v+) log (t) |q(vd)vF+v(yij , )dqv(v)Fv(yij , )dv 183 U Uq(v)k(B, xij )> j(qt0)(,v1) j 182184 179179 U a = Uk(B, x )(2p(yv B) 1) N | 6 Zij ij Zk(B, xij )> Z 0,(1t) Z L1( , B,q184(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yij , )dv X 183185 180 U Ua6 = j k(B, xij )(2q(yvij) 1) (2Nyijj 1)k(BX, xij )>| 180 L1 y, , B, q(v) (t) 185 XZj L1 y, , B,q(v(2)yiZj 1)k(B, xij )> 184186 181181 X U XL1 y, , B, q(v)U B, 185 182182 186 U B, p(v B) U 187 L1( , B,q(v)) = log(p( )) + q(v) log | dv + q(v)Fv(yi , )dv B, U p(v B)L⇤(y, ,jB) j 186 183183 187 U U U q(v) 1 U 188 L1( , B,q(v)) = log(p( )) +LZ 1⇤(qy(,v) log, B) | dv + Z q(v)Fv(yij , )dv 187 184184 188 U L1⇤(yU, , B) q(vL)1 y, X, B, qj(v) U LZ1 y, U, B, q(v) Z 189 185 U X 188 185 189 L1 y, ,LB1, qy(v, ) ,UB, q(v) U L1UyB, ,, B, q(v) 189190 186186 190 U U (A1, A1) L (y, , B) r 190191 187 191 (A1, A1) 1⇤ B, 187 (A1, AU1U) (a2, a2) 188 r r 191 188 192 (a , a )L1 yL,⇤r(y, B, , q, B(v)) 192 2 2 1U (a3, a3) 192 189 r (a2, a2)U r 193 189 193 (a3, a3) L ry, , B, q(v) 190 1(A1, A1) (a4, a4) 193 190 194 r (a3, aU3) r (a4, a4) r 194 191 (ar, a ) (a5, a4) 194 191 195 r (A2 r, 2A ) 192 (a5, a4)(a4, a1 4) 1 r 195195 (a3, ar3) 192 196 r (ra r, a ) 196 193 (a5L, (2ya,4) ,2B, q(v)) = log(p( )) + H q(v) 196 193 197 (a14, ra4) 194 L1(y, , B, q(v)) = log(p( ))(ra +3r,HUaq3()v) U 197197 194 U U(a5, ra4) 195 198 + rlog (v 0,k(B, B) 198 195 (a4, a4) 198 196 199 + logL1(y(,v 0,,kB(,Bq(,vB))) = log(rp( N)) +| H q(v) 199 196 N LU|(y, , B, q(v))(a =Z5, log(a4pU)( )) + H q(v) 199 197 200 Z 1 U r U 200 197 + q(v)Fv(yij , )dv 201 j 200 198 + + q(vlog)Fv(yij ,(v)d0v,k(B, B) 201 198 j +L (ylog,N , B|,(qv(v0),k) =(B log(,ZB)p( )) + H q(v) 201 199 202 Z 1 N | X 202 199 X Z Z U U 200 k 203 k 203202 200 + ++ qlog(v)qF(vv)((Fvyvi0(j,ky,ij(,)dB,)dvBv) 201 204 j j N | sgn ci sgn(wi>x) (2) 204203 sgn ci Zsgn(wi>x) · (2) 202201 ZXZ · i=1 205 Xi=1 205 202 k X 204 203 X+ q(v)Fv(yij , )dv 206 Question: Whatj k is the difference? 206 204203 Zsgn ci sgn(wi>x) (2) 205 207 X · k 207 205204 sgnk i=1ci sgn(wi>x) (2) 206 Xk 208 · = sgn ciwi>x (3) 208 206205 = sgn cii=1wi>x (3) 207 209 Voted: sgnX ci sgn(wi>x) i=1 (2) 209 207206 i=1 k · X X i=1 210208 210 208207 = sgnX ciwi>x (3) 211 211209 209208 k i=1 yiw>xi X yiw>xi 212 209 212 < ⌘ < ⌘ (4) (4) 210 210 Averaged:w= sgn ciwi>x w (3) 213 211210 213 k k k k 211 yiiw=1>xi 214 212211 214 X < ⌘ (4) 212 w 215 215 213212 w>x = s , w>x = s , wk>xk= s c = c = c =1 (5) (5) 213 1 1 2 2 3 3 1 2 3 214213 214 215214 s + s + s 0 (5) Averaged: 1 2 3 215 215 4 4 Voted: Any two are positive 4 81 4 4 Summary: Perceptron

• Online learning algorithm, very widely used, easy to implement

• Additive updates to weights

• Geometric interpretation

• Mistake bound

• Practical variants abound

• You should be able to implement the Perceptron algorithm

82