Week 2: Linear Classification: Perceptron

Applied Machine Learning CIML Chap 4 (A Geometric Approach) “Equations are just the boring part of mathematics. I attempt to see things in terms of geometry.” ―Stephen Hawking Week 2: Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU/Amazon) Roadmap for Weeks 2-3 • Week 2: Linear Classifier and Perceptron • Part I: Brief History of the Perceptron • Part II: Linear Classifier and Geometry (testing time) • Part III: Perceptron Learning Algorithm (training time) • Part IV: Convergence Theorem and Geometric Proof • Part V: Limitations of Linear Classifiers, Non-Linearity, and Feature Maps • Week 3: Extensions of Perceptron and Practical Issues • Part I: My Perceptron Demo in Python • Part II: Voted and Averaged Perceptrons • Part III: MIRA and Aggressive MIRA • Part IV: Practical Issues and HW1 Part V: Perceptron vs. Logistic Regression (hard vs. soft); Gradient Descent • 2 Part I • Brief History of the Perceptron 3 Perceptron (1959-now) Frank Rosenblatt deep learning ~1986; 2006-now multilayer perceptron logistic regression perceptron SVM 1958 1959 1964;1995 kernels 1964 cond. random fields structured perceptron structured SVM 2001 2002 2003 5 Neurons • Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations 6 Frank Rosenblatt’s Perceptron 7 Multilayer Perceptron (Neural Net) 8 Brief History of Perceptron 1997 batch Cortes/Vapnik +soft-margin SVM online approx. +kernels max margin subgradient descent 2007--2010 Singer group +max margin minibatch Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1969* 1999 Rosenblatt Novikoff Minsky/Papert DEAD Freund/Schapire invention proof book killed it voted/avg: revived inseparable case 2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students 9 Part II • Linear Classifier and Geometry (testing time) • decision boundary and normal vector w • not separable through the origin: add bias b • geometric review of linear algebra • augmented space (no explicit bias; implicit as w0=b) Test Time Input x Linear Classifier Prediction σ(w ·x) Model w Training Time Input x Perceptron Learner Model w Output y 10 Linear Classifier and Geometry linear classifiers: perceptron, logistic regression, (linear) SVMs, etc. x1 x2 x3 ... xn x2 w x w w x cos ✓ = · 1 n k k w k k weights x output f(x)=σ(w x) σ · ✓ negative O w x1 w x < 0 · positive w x > 0 weight vector w: “prototype” of positive examples · it’s also the normal vector of the decision boundary meaning of w · x: agreement with positive direction test: input: x, w; output:1 if w · x >0 else -1 separating hyperplane training: input: (x, y) pairs; output: w (decision boundary) w x =0 11 · What if not separable through origin? solution: add bias b x1 x2 x3 ... xn x2 w x w w x cos ✓ = · 1 n k k w k k weights x output f(x)=σ(w x + b) · ✓ negative O w x1 w x + b<0 · positive w x + b>0 · w x + b =0 · b | | w k k 12 Geometric Review of Linear Algebra line in 2D (n-1)-dim hyperplane in n-dim x2 x2 w1x1 + w2x2 + b =0 w x + b =0 · (x1⇤,x2⇤) x⇤ b | | (w1,w2) k k (w1,w2) O x1 O x1 required: algebraic and geometric meanings of dot product x3 w1x⇤ + w2x⇤ + b (w1,w2) (x1,x2)+b w x + b | 1 2 | = | · | | · | w2 + w2 (w1,w2) w 1 2 k k k k point-to-linep distance point-to-hyperplane distance http://classes.engr.oregonstate.edu/eecs/fall2017/cs534/extra/LA-geometry.pdf 13 Augmented Space: dimensionality+1 x1 x2 x3 ... xn explicit bias f(x)=σ(w x + b) w1 wn · can’t separate in 1D weights from the origin O output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=σ((b; w) (1; x)) b · weights 1 can separate in 2D from the origin O output 14 Augmented Space: dimensionality+1 x1 x2 x3 ... xn explicit bias f(x)=σ(w x + b) w1 wn · can’t separate in 2D weights from the origin output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=σ((b; w) (1; x)) b · weights can separate in 3D from the origin output 15 Part III • The Perceptron Learning Algorithm (training time) • the version without bias (augmented space) • side note on mathematical notations • mini-demo Test Time Input x Linear Classifier Prediction σ(w ·x) Model w Training Time Input x Perceptron Learner Model w Output y 16 Perceptron Ham Spam 17 The Perceptron Algorithm input: training data D output: weights w x initialize w 0 while not converged aafor (x,y) D w0 aaaaif y(w 2x) 0 aaaaaaw · w +yx w • the simplest machine learning algorithm • keep cycling through the training data • update w if there is a mistake on example (x, y) until all examples are classified correctly • 18 Side Note on Mathematical Notations • I’ll try my best to be consistent in notations • e.g., bold-face for vectors, italic for scalars, etc. • avoid unnecessary superscripts and subscripts by using a “Pythonic” rather than a “C” notational style input:• most trainingtextbooks data haveD consistent but bad notations output: weights w initialize w 0 initialize w = 0 and b =0 while not converged repeat aafor (x,y) D if yi [ w, xi + b] 0 then 2 w h w +iy x and b b + y aaaaif y(w x) 0 i i i aaaaaaw · w +yx end if until all classified correctly good notations: bad notations: consistent, Pythonic style inconsistent, unnecessary i and b 19 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w0 w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w 21 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w 22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w0 w 22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx w 23 24 Part IV • Linear Separation, Convergence Theorem and Proof • formal definition of linear separation • perceptron convergence theorem • geometric proof • what variables affect convergence bound? 25 Linear Separation; Convergence Theorem • dataset D is said to be “linearly separable” if there exists some unit oracle vector u: ∣∣u|| = 1 which correctly classifies every example (x, y) with a margin at least ẟ: y(u x) δ for all (x,y) D · ≥ 2 • then the perceptron must converge to a linear separator after at most R2/ẟ2 mistakes (updates) where R = max x (x,y) Dk k 2 • convergence rate R2/ẟ2 δ δ x ⊕ ⊕ dimensionality independent u x δ • · ≥ R u : u =1 k k • dataset size independent • order independent (but order matters in output) • scales with ‘difficulty’ of problem Geometric Proof, part 1 • part 1: progress (alignment) on oracle projection assume w(0) = 0, and w(i) is the weight before the ith update (on (x,y)) w(i+1) = w(i) + yx u w(i+1) = u w(i) + y(u x) · · · u w(i+1) u w(i) + δ y(u x) δ for all (x,y) D · ≥ · · ≥ 2 u w(i+1) iδ δ δ · ≥ x projection on u increases! ⊕ ⊕ u x δ · ≥ (more agreement w/ oracle direction) u : u =1 k k (i+1) (i+1) (i+1) w w(i+1) ( w = u w u w iδ i k k ≥ · ≥ ) u w(i) · u w(i+1) 27 · Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx 2 2 w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x 2 R = max x ⊕ ⊕ iR (x,y) Dk k 2 R w w(i+1) ( i ) ✓ 90◦ ≥ cos ✓ 0 w(i) x 0 · 28 Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx p iR 2 2 iδ w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x 2 R = max x ⊕ ⊕ iR (x,y) Dk k 2 R Combine with part 1: w w(i+1) ( i (i+1) (i+1) (i+1) ) w = u w u w iδ ✓ 90◦ k k ≥ · ≥ ≥ cos ✓ 0 i R2/δ2 w(i) x 0 · 28 2 2 Convergence Bound R /δ where R = max xi i k k • is independent of: • dimensionality • number of examples • order of examples constant learning rate narrow margin: wide margin: • hard to separate easy to separate • and is dependent of: • separation difficulty (margin ẟ) • feature scale (radius R) • initial weight w(0) changes how fast it converges, but not whether it’ll converge • 29 Part V • Limitations of Linear Classifiers and Feature Maps • XOR: not linearly separable • perceptron cycling theorem • solving XOR: non-linear feature map • “preview demo”: SVM with non-linear kernel • redefining “linear” separation under feature map 30 XOR • XOR - not linearly separable • Nonlinear separation is trivial • Caveat from “Perceptrons” (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

Load more