The Perceptron Algorithm CSE 446: Machine Learning Slides by Emily Fox (Mostly) Presented by Anna Karlin

additional shuffled office hours for upcoming week see website

a tag you will take midterm in different room

The Perceptron Algorithm CSE 446: Machine Learning Slides by Emily Fox (mostly) Presented by Anna Karlin

April 29, 2019

2 CSE 446: Machine Learning The perceptron algorithm

• Classification setting: y in {-1,+1} XelRd weird • Linear model Find w b - Prediction: t W'Itb signfutxtb gwen y I O W • Training: - Initialize weight vector: Wo i O - At each time step: wt • Observe features: It • Make prediction: wF tb • Observe true class: j sgn Yt • Update model: - If prediction is not equal to truth if 5tFy Wti wttYtXt online else wf Wf 4 setting ©2017 Emily Fox CSE 446: Machine Learning or batch I blue Picture due to Killian Weinberger 7

tx my I eRd T f

5 CSE 446: Machine Learning if II i How many mistakes can a perceptron make?

7 Figure by Alexander Ihler CSE 446: Machine Learning I Linear separability using margin g

xiaw H M raft

Data linearly separable, if there exists Figure by Kilian Weinberger - a vector Hw4t I tw o - w x a margin sit dist fray pt ki to hyperplane such that T Yi I wth y i x Be 8 8 ©2017 Emily Fox CSE 446: Machine Learning

Yi xiao go or Perceptron analysis: Linearly separable case

Theorem [Block, Novikoff]: - Given a sequence of labeled examples:

- Each featureRisd vector has boundedEigg norm:

- Hxiller If dataset is linearly separable: wt Hw4H and Bosh r ti Figure by Kilian Weinberger Then the # mistakes made yicxi.atby the online perceptron on this sequence is bounded by R2 Tra

• Every time we make a mistake, we get w closer to w*: - Mistake at time t: - Taking dot product with w*: xtIjAt4kI - Thus after m mistakes: 4eu wxy.ae y s mm w9w tYtw9Xtft wttiw• On the other hand, norm of wt+1 doesn’t grow too fast:

Figure by Kilian Weinberger 2 -zgrCwtH.tIsJ.wThus, after m mistakes: Te te eFp Hw iH • Putting all together: mR2I Rm e w wtf EHW.lt In lwttiHErmR 10 ©2017 Emily Fox CSE 446: Machine Learning Pm Erm R k Be rm s Rg me 2 or

Beyond linearly separable case

• Perceptron algorithm is super cool! - No assumption about data distribution! • Could be generated by an adversary, no need to be iid - Makes a fixed number of mistakes, and it’s done forever! • Even if you see infinite data

• Possible to motivate the perceptron algorithm as an implementation of SGD for minimizing a certain loss function. (see next homework)

• However, real world is not linearly separable t - Can’t expect never to make mistakes again - If data not separable, cycles forever and hard to detect. - Even if separable, may not give good generalization accuracy (make have small margin).

and training set: {(xi, yi)}i=1..n

• Define feature map that transforms each input vector xi to higher dimensional feature vector ϕ(xi) T Example: (xi[1], xi[2] , xi[3]) t.EE 2 2 T ϕ(xi) = (1, xi[1] , xi[1] , xi[1]xi[2], xi[2], xi[2] , cos( xi[3]/6))

x 7 5 4 4tx F o

From ”Foundations of Data Science” by Blum, Kannan, Hopcroft 17 CSE 446: Machine Learning What to do if data are not linearly separable? Use feature maps… d (x):R F !

Could even be infinite dimensional….

Feature space can get really large really quickly!

r=4

d – input dimension r – degree of polynomial

terms r=3

r=2 grows fast! number of monomialterms number of r = 6, d = 100 number of input dimensions about 1.6 billion terms

• Allows us to do these apparently intractable calculations efficiently!

20 CSE 446: Machine Learning Perceptron revisited Write weight vector in terms of cx mistaken data points only.

Let M(t) be time steps up to t when mistakes were made: wth jerkEyjxj Prediction rule becomes:

sign sixi • When using high dimensional features: xtf.fm sgnffeme.si 4xL xj sign Fanti 4 Xj Classification only depends on inner products! 21 CSE 446: Machine Learning Why does dependence on inner products help?

• Because sometimes they can be computed much much much more efficiently.

(u) (v)= ·

22 CSE 446: Machine Learning

1 E i

141.417 1 24 1 II TITIAN M

From ”Foundations of Data Science” by Blum, Kannan, Hopcroft 23 CSE 446: Machine Learning Dot-product of polynomials: why useful? • Because sometimes they can be computed much much much more efficiently.

(u) (v)= ·

T T H

24 CSE 446: Machine Learning Kernel trick

• Allows us to do these apparently intractable calculations efficiently!

• If we can compute inner products efficiently, can run the algorithm (e.g., Perceptron) efficiently.

• Hugely important idea in ML.

25 CSE 446: Machine Learning A kernel fu K x x Finally, the kernel trick is a Sm that implements in feature Kernelized perceptron inner products the space • 4444 Every time you make a mistake, remember (xt,yt) qpf F

• Kernelized perceptron prediction for x:

o_0

• Polynomials of degree exactly p p

• Polynomials of degree up to p p

• Gaussian (squared exponential) kernel 2

• Many many many others

• Prove that the solution is always in span of training points. I.e., w Enix • Rewrite the algorithm so that all training or test inputs are accessed only through inner products with other inputs.

• Choose (or define) a kernel function and substitute K(xi, xj) for T xi xj

28 CSE 446: Machine Learning What you need to know

• Linear separability in higher-dim feature space • The kernel trick • Kernelized perceptron • Polynomial and other common kernels (will see more later)

May 1, 2019

To o t h E yx k Go Ew

5 CSE 446: Machine Learning Et x H Given hyperplane, what is margin?

6 CSE 446: Machine Learning Our optimization problem

SVMs can be kernelized!!!!

Courtesy Mehryar Mohri

12 CSE 446: Machine Learning What if data are still not linearly separable?

Courtesy Mehryar Mohri

13 CSE 446: Machine Learning Final objective

• Given a dataset:

• Minimize regularized hinge loss:

• Hinge loss:

• Subgradient of hinge loss:

• Hinge loss:

• Subgradient of hinge loss:

In one line:

• Given data:

• Want to minimize:

• As we’ve discussed, subgradient descent works like gradient descent: - But if there are multiple subgradients at a point, just pick (any) one:

19 CSE 446: Machine Learning SVM

• Gradient Descent Update

• SGD update

• Given i.i.d. data set:

Courtesy Killian Weinberger

• Find parameters w to minimize average loss (or regularized version):

21 CSE 446: Machine Learning What you need to know…

• Maximizing margin • Derivation of SVM formulation • Non-linearly separable case - Hinge loss - a.k.a. adding slack variables • Can optimize SVMs with SGD - Many other approaches possible