Week 2: Linear Classification: Perceptron

Applied Machine Learning CIML Chap 4 (A Geometric Approach) “Equations are just the boring part of mathematics. I attempt to see things in terms of geometry.” ―Stephen Hawking Week 2: Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU/Amazon) Roadmap for Weeks 2-3 • Week 2: Linear Classifier and Perceptron • Part I: Brief History of the Perceptron • Part II: Linear Classifier and Geometry (testing time) • Part III: Perceptron Learning Algorithm (training time) • Part IV: Convergence Theorem and Geometric Proof • Part V: Limitations of Linear Classifiers, Non-Linearity, and Feature Maps • Week 3: Extensions of Perceptron and Practical Issues • Part I: My Perceptron Demo in Python • Part II: Voted and Averaged Perceptrons • Part III: MIRA and Aggressive MIRA • Part IV: Practical Issues and HW1 Part V: Perceptron vs. Logistic Regression (hard vs. soft); Gradient Descent • 2 Part I • Brief History of the Perceptron 3 Perceptron (1959-now) Frank Rosenblatt deep learning ~1986; 2006-now multilayer perceptron logistic regression perceptron SVM 1958 1959 1964;1995 kernels 1964 cond. random fields structured perceptron structured SVM 2001 2002 2003 5 Neurons • Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations 6 Frank Rosenblatt’s Perceptron 7 Multilayer Perceptron (Neural Net) 8 Brief History of Perceptron 1997 batch Cortes/Vapnik +soft-margin SVM online approx. +kernels max margin subgradient descent 2007--2010 Singer group +max margin minibatch Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1969* 1999 Rosenblatt Novikoff Minsky/Papert DEAD Freund/Schapire invention proof book killed it voted/avg: revived inseparable case 2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students 9 Part II • Linear Classifier and Geometry (testing time) • decision boundary and normal vector w • not separable through the origin: add bias b • geometric review of linear algebra • augmented space (no explicit bias; implicit as w0=b) Test Time Input x Linear Classifier Prediction σ(w ·x) Model w Training Time Input x Perceptron Learner Model w Output y 10 Linear Classifier and Geometry linear classifiers: perceptron, logistic regression, (linear) SVMs, etc. x1 x2 x3 ... xn x2 w x w w x cos ✓ = · 1 n k k w k k weights x output f(x)=σ(w x) σ · ✓ negative O w x1 w x < 0 · positive w x > 0 weight vector w: “prototype” of positive examples · it’s also the normal vector of the decision boundary meaning of w · x: agreement with positive direction test: input: x, w; output:1 if w · x >0 else -1 separating hyperplane training: input: (x, y) pairs; output: w (decision boundary) w x =0 11 · What if not separable through origin? solution: add bias b x1 x2 x3 ... xn x2 w x w w x cos ✓ = · 1 n k k w k k weights x output f(x)=σ(w x + b) · ✓ negative O w x1 w x + b<0 · positive w x + b>0 · w x + b =0 · b | | w k k 12 Geometric Review of Linear Algebra line in 2D (n-1)-dim hyperplane in n-dim x2 x2 w1x1 + w2x2 + b =0 w x + b =0 · (x1⇤,x2⇤) x⇤ b | | (w1,w2) k k (w1,w2) O x1 O x1 required: algebraic and geometric meanings of dot product x3 w1x⇤ + w2x⇤ + b (w1,w2) (x1,x2)+b w x + b | 1 2 | = | · | | · | w2 + w2 (w1,w2) w 1 2 k k k k point-to-linep distance point-to-hyperplane distance http://classes.engr.oregonstate.edu/eecs/fall2017/cs534/extra/LA-geometry.pdf 13 Augmented Space: dimensionality+1 x1 x2 x3 ... xn explicit bias f(x)=σ(w x + b) w1 wn · can’t separate in 1D weights from the origin O output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=σ((b; w) (1; x)) b · weights 1 can separate in 2D from the origin O output 14 Augmented Space: dimensionality+1 x1 x2 x3 ... xn explicit bias f(x)=σ(w x + b) w1 wn · can’t separate in 2D weights from the origin output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=σ((b; w) (1; x)) b · weights can separate in 3D from the origin output 15 Part III • The Perceptron Learning Algorithm (training time) • the version without bias (augmented space) • side note on mathematical notations • mini-demo Test Time Input x Linear Classifier Prediction σ(w ·x) Model w Training Time Input x Perceptron Learner Model w Output y 16 Perceptron Ham Spam 17 The Perceptron Algorithm input: training data D output: weights w x initialize w 0 while not converged aafor (x,y) D w0 aaaaif y(w 2x) 0 aaaaaaw · w +yx w • the simplest machine learning algorithm • keep cycling through the training data • update w if there is a mistake on example (x, y) until all examples are classified correctly • 18 Side Note on Mathematical Notations • I’ll try my best to be consistent in notations • e.g., bold-face for vectors, italic for scalars, etc. • avoid unnecessary superscripts and subscripts by using a “Pythonic” rather than a “C” notational style input:• most trainingtextbooks data haveD consistent but bad notations output: weights w initialize w 0 initialize w = 0 and b =0 while not converged repeat aafor (x,y) D if yi [ w, xi + b] 0 then 2 w h w +iy x and b b + y aaaaif y(w x) 0 i i i aaaaaaw · w +yx end if until all classified correctly good notations: bad notations: consistent, Pythonic style inconsistent, unnecessary i and b 19 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w0 w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w 21 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w 22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx x w0 w 22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx w 23 24 Part IV • Linear Separation, Convergence Theorem and Proof • formal definition of linear separation • perceptron convergence theorem • geometric proof • what variables affect convergence bound? 25 Linear Separation; Convergence Theorem • dataset D is said to be “linearly separable” if there exists some unit oracle vector u: ∣∣u|| = 1 which correctly classifies every example (x, y) with a margin at least ẟ: y(u x) δ for all (x,y) D · ≥ 2 • then the perceptron must converge to a linear separator after at most R2/ẟ2 mistakes (updates) where R = max x (x,y) Dk k 2 • convergence rate R2/ẟ2 δ δ x ⊕ ⊕ dimensionality independent u x δ • · ≥ R u : u =1 k k • dataset size independent • order independent (but order matters in output) • scales with ‘difficulty’ of problem Geometric Proof, part 1 • part 1: progress (alignment) on oracle projection assume w(0) = 0, and w(i) is the weight before the ith update (on (x,y)) w(i+1) = w(i) + yx u w(i+1) = u w(i) + y(u x) · · · u w(i+1) u w(i) + δ y(u x) δ for all (x,y) D · ≥ · · ≥ 2 u w(i+1) iδ δ δ · ≥ x projection on u increases! ⊕ ⊕ u x δ · ≥ (more agreement w/ oracle direction) u : u =1 k k (i+1) (i+1) (i+1) w w(i+1) ( w = u w u w iδ i k k ≥ · ≥ ) u w(i) · u w(i+1) 27 · Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx 2 2 w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x 2 R = max x ⊕ ⊕ iR (x,y) Dk k 2 R w w(i+1) ( i ) ✓ 90◦ ≥ cos ✓ 0 w(i) x 0 · 28 Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx p iR 2 2 iδ w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x 2 R = max x ⊕ ⊕ iR (x,y) Dk k 2 R Combine with part 1: w w(i+1) ( i (i+1) (i+1) (i+1) ) w = u w u w iδ ✓ 90◦ k k ≥ · ≥ ≥ cos ✓ 0 i R2/δ2 w(i) x 0 · 28 2 2 Convergence Bound R /δ where R = max xi i k k • is independent of: • dimensionality • number of examples • order of examples constant learning rate narrow margin: wide margin: • hard to separate easy to separate • and is dependent of: • separation difficulty (margin ẟ) • feature scale (radius R) • initial weight w(0) changes how fast it converges, but not whether it’ll converge • 29 Part V • Limitations of Linear Classifiers and Feature Maps • XOR: not linearly separable • perceptron cycling theorem • solving XOR: non-linear feature map • “preview demo”: SVM with non-linear kernel • redefining “linear” separation under feature map 30 XOR • XOR - not linearly separable • Nonlinear separation is trivial • Caveat from “Perceptrons” (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

Week 2: Linear Classification: Perceptron

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support