additional shuffled office hours for upcoming week see website
a tag you will take midterm in different room
The Perceptron Algorithm CSE 446: Machine Learning Slides by Emily Fox (mostly) Presented by Anna Karlin
April 29, 2019
©2017 Emily Fox CSE 446: Machine Learning Can we do binary classification without a model? WEIRD x.gg yies1,13
2 CSE 446: Machine Learning The perceptron algorithm
©2017 Emily Fox CSE 446: Machine Learning For simplicity in this lecture we will assume that b=0 The perceptron algorithm [Rosenblatt ‘58, ‘62]
• Classification setting: y in {-1,+1} XelRd weird • Linear model Find w b - Prediction: t W'Itb signfutxtb gwen y I O W • Training: - Initialize weight vector: Wo i O - At each time step: wt • Observe features: It • Make prediction: wF tb • Observe true class: j sgn Yt • Update model: - If prediction is not equal to truth if 5tFy Wti wttYtXt online else wf Wf 4 setting ©2017 Emily Fox CSE 446: Machine Learning or batch I blue Picture due to Killian Weinberger 7
tx my I eRd T f
5 CSE 446: Machine Learning if II i How many mistakes can a perceptron make?
©2017 Emily Fox CSE 446: Machine Learning Assumption: data is linearly separable
7 Figure by Alexander Ihler CSE 446: Machine Learning I Linear separability using margin g
xiaw H M raft
Data linearly separable, if there exists Figure by Kilian Weinberger - a vector Hw4t I tw o - w x a margin sit dist fray pt ki to hyperplane such that T Yi I wth y i x Be 8 8 ©2017 Emily Fox CSE 446: Machine Learning
Yi xiao go or Perceptron analysis: Linearly separable case
Theorem [Block, Novikoff]: - Given a sequence of labeled examples:
- Each featureRisd vector has boundedEigg norm:
- Hxiller If dataset is linearly separable: wt Hw4H and Bosh r ti Figure by Kilian Weinberger Then the # mistakes made yicxi.atby the online perceptron on this sequence is bounded by R2 Tra
9 ©2017 Emily Fox CSE 446: Machine Learning Hxiller and Hiott Rosy woe r ti Perceptron proof for linearly separable case
• Every time we make a mistake, we get w closer to w*: - Mistake at time t: - Taking dot product with w*: xtIjAt4kI - Thus after m mistakes: 4eu wxy.ae y s mm w9w tYtw9Xtft wttiw• On the other hand, norm of wt+1 doesn’t grow too fast:
Figure by Kilian Weinberger 2 -zgrCwtH.tIsJ.wThus, after m mistakes: Te te eFp Hw iH • Putting all together: mR2I Rm e w wtf EHW.lt In lwttiHErmR 10 ©2017 Emily Fox CSE 446: Machine Learning Pm Erm R k Be rm s Rg me 2 or
Beyond linearly separable case
• Perceptron algorithm is super cool! - No assumption about data distribution! • Could be generated by an adversary, no need to be iid - Makes a fixed number of mistakes, and it’s done forever! • Even if you see infinite data
11 ©2017 Emily Fox CSE 446: Machine Learning X o
12 ©2017 Emily Fox CSE 446: Machine Learning o x Beyond linearly separable case
• Perceptron algorithm is super cool! - No assumption about data distribution! • Could be generated by an adversary, no need to be iid - Makes a fixed number of mistakes, and it’s done for ever! • Even if you see infinite data
• Possible to motivate the perceptron algorithm as an implementation of SGD for minimizing a certain loss function. (see next homework)
13 ©2017 Emily Fox CSE 446: Machine Learning Beyond linearly separable case
• Perceptron algorithm is super cool! - No assumption about data distribution! • Could be generated by an adversary, no need to be iid - Makes a fixed number of mistakes, and it’s done for ever! • Even if you see infinite data
• However, real world is not linearly separable t - Can’t expect never to make mistakes again - If data not separable, cycles forever and hard to detect. - Even if separable, may not give good generalization accuracy (make have small margin).
14 ©2017 Emily Fox CSE 446: Machine Learning The kernelized perceptron
©2017 Emily Fox CSE 446: Machine Learning What to do if data are not linearly separable? d Use feature maps… (x):R F ! • Start with set of inputs for each observation x = (x[1],x[2],…, x[d])
and training set: {(xi, yi)}i=1..n
• Define feature map that transforms each input vector xi to higher dimensional feature vector ϕ(xi) T Example: (xi[1], xi[2] , xi[3]) t.EE 2 2 T ϕ(xi) = (1, xi[1] , xi[1] , xi[1]xi[2], xi[2], xi[2] , cos( xi[3]/6))
16 ©2017 Emily Fox CSE 446: Machine Learning E 4K I I
x 7 5 4 4tx F o
From ”Foundations of Data Science” by Blum, Kannan, Hopcroft 17 CSE 446: Machine Learning What to do if data are not linearly separable? Use feature maps… d (x):R F !
Could even be infinite dimensional….
18 ©2017 Emily Fox CSE 446: Machine Learning Example: Higher order polynomials
Feature space can get really large really quickly!
r=4
d – input dimension r – degree of polynomial
terms r=3
r=2 grows fast! number of monomialterms number of r = 6, d = 100 number of input dimensions about 1.6 billion terms
19 ©2017 Emily Fox CSE 446: Machine Learning Kernel trick
• Allows us to do these apparently intractable calculations efficiently!
20 CSE 446: Machine Learning Perceptron revisited Write weight vector in terms of cx mistaken data points only.
Let M(t) be time steps up to t when mistakes were made: wth jerkEyjxj Prediction rule becomes:
sign sixi • When using high dimensional features: xtf.fm sgnffeme.si 4xL xj sign Fanti 4 Xj Classification only depends on inner products! 21 CSE 446: Machine Learning Why does dependence on inner products help?
• Because sometimes they can be computed much much much more efficiently.