<<

Applied CIML Chap 4 (A Geometric Approach)

“Equations are just the boring part of mathematics. I attempt to see things in terms of geometry.” ―Stephen Hawking Week 2: Linear Classification:

Professor Liang Huang some slides from Alex Smola (CMU/Amazon) Roadmap for Weeks 2-3

• Week 2: Linear Classifier and Perceptron • Part I: Brief History of the Perceptron • Part II: Linear Classifier and Geometry (testing time) • Part III: Perceptron Learning (training time) • Part IV: Convergence Theorem and Geometric Proof • Part V: Limitations of Linear Classifiers, Non-Linearity, and Feature Maps • Week 3: Extensions of Perceptron and Practical Issues • Part I: My Perceptron Demo in Python • Part II: Voted and Averaged • Part III: MIRA and Aggressive MIRA • Part IV: Practical Issues and HW1 Part V: Perceptron vs. (hard vs. soft); • 2 Part I

• Brief History of the Perceptron

3 Perceptron (1959-now)

Frank Rosenblatt ~1986; 2006-now

logistic regression perceptron SVM 1958 1959 1964;1995

kernels 1964 cond. random fields structured perceptron structured SVM 2001 2002 2003 5 Neurons

• Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations

6 ’s Perceptron

7 Multilayer Perceptron (Neural Net)

8 Brief History of Perceptron

1997 batch Cortes/Vapnik +soft-margin SVM online approx. +kernels max margin subgradient descent 2007--2010 Singer group +max margin minibatch Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1969* 1999 Rosenblatt Novikoff Minsky/Papert DEAD Freund/Schapire invention proof book killed it voted/avg: revived inseparable case

2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students 9 Part II • Linear Classifier and Geometry (testing time) • decision boundary and normal vector w • not separable through the origin: add bias b • geometric review of linear algebra • augmented space (no explicit bias; implicit as w0=b)

Test Time Input x Linear Classifier Prediction σ(w ·x) Model w

Training Time Input x Perceptron Learner Model w Output y 10 Linear Classifier and Geometry linear classifiers: perceptron, logistic regression, (linear) SVMs, etc. x1 x2 x3 ... xn

x2 w x w w x cos ✓ = · 1 n k k w k k weights x

output f(x)=(w x) σ · ✓ negative O w x1 w x < 0 · positive w x > 0 weight vector w: “prototype” of positive examples · it’s also the normal vector of the decision boundary meaning of w · x: agreement with positive direction test: input: x, w; output:1 if w · x >0 else -1 separating hyperplane training: input: (x, y) pairs; output: w (decision boundary) w x =0 11 · What if not separable through origin? solution: add bias b x1 x2 x3 ... xn

x2 w x w w x cos ✓ = · 1 n k k w k k weights x

output f(x)=(w x + b) · ✓ negative O w x1 w x + b<0 · positive w x + b>0 · w x + b =0 ·

b | | w k k 12 Geometric Review of Linear Algebra line in 2D (n-1)-dim hyperplane in n-dim

x2 x2

w1x1 + w2x2 + b =0 w x + b =0 ·

(x1⇤,x2⇤) x⇤ b | | (w1,w2) k k (w1,w2)

O x1 O x1 required: algebraic and geometric meanings of dot product x3

w1x⇤ + w2x⇤ + b (w1,w2) (x1,x2)+b w x + b | 1 2 | = | · | | · | w2 + w2 (w1,w2) w 1 2 k k k k point-to-linep distance point-to-hyperplane distance

http://classes.engr.oregonstate.edu/eecs/fall2017/cs534/extra/LA-geometry.pdf 13 Augmented Space: dimensionality+1

x1 x2 x3 ... xn explicit bias f(x)=(w x + b) w1 wn · can’t separate in 1D weights from the origin

O output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=((b; w) (1; x)) b ·

weights 1 can separate in 2D from the origin O output 14 Augmented Space: dimensionality+1

x1 x2 x3 ... xn explicit bias f(x)=(w x + b) w1 wn · can’t separate in 2D weights from the origin

output x ... x x0 = 1 x1 x2 3 n augmented space w 0 = w1 wn f(x)=((b; w) (1; x)) b · weights can separate in 3D from the origin output 15 Part III • The Perceptron Learning Algorithm (training time) • the version without bias (augmented space) • side note on mathematical notations • mini-demo

Test Time Input x Linear Classifier Prediction σ(w ·x) Model w

Training Time Input x Perceptron Learner Model w Output y

16 Perceptron

Ham Spam

17 The Perceptron Algorithm input: training D output: weights w x initialize w 0 while not converged aafor (x,y) D w0 aaaaif y(w 2x) 0 aaaaaaw · w +yx w • the simplest machine learning algorithm • keep cycling through the training data • update w if there is a mistake on example (x, y) until all examples are classified correctly • 18 Side Note on Mathematical Notations • I’ll try my best to be consistent in notations • e.g., bold-face for vectors, italic for scalars, etc. • avoid unnecessary superscripts and subscripts by using a “Pythonic” rather than a “C” notational style input:• most trainingtextbooks data haveD consistent but bad notations output: weights w initialize w 0 initialize w = 0 and b =0 while not converged repeat aafor (x,y) D if yi [ w, xi + b] 0 then 2 w h w +iy x and b b + y aaaaif y(w x) 0 i i i aaaaaaw · w +yx end if until all classified correctly good notations: bad notations: consistent, Pythonic style inconsistent, unnecessary i and b 19 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx

x

w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx

x

w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx

x

w0

w (bias=0) 20 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx

x

w

21 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx

x

w

22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx

x

w0 w

22 input: training data D output: weights w Demo initialize w 0 while not converged aafor (x,y) D aaaaif y(w 2x) 0 aaaaaaw · w +yx

w

23 24 Part IV

• Linear Separation, Convergence Theorem and Proof • formal definition of linear separation • perceptron convergence theorem • geometric proof • what variables affect convergence bound?

25 Linear Separation; Convergence Theorem • dataset D is said to be “linearly separable” if there exists some unit oracle vector u: ∣∣u|| = 1 which correctly classifies every example (x, y) with a margin at least ẟ: y(u x) for all (x,y) D · 2 • then the perceptron must converge to a linear separator after at most R2/ẟ2 mistakes (updates) where R = max x (x,y) Dk k 2 • convergence rate R2/ẟ2 x dimensionality independent u x • · R u : u =1 k k • dataset size independent • order independent (but order matters in output) • scales with ‘difficulty’ of problem Geometric Proof, part 1 • part 1: progress (alignment) on oracle projection assume w(0) = 0, and w(i) is the weight before the ith update (on (x,y)) w(i+1) = w(i) + yx u w(i+1) = u w(i) + y(u x) · · · u w(i+1) u w(i) + y(u x) for all (x,y) D · · · 2 u w(i+1) i · x projection on u increases! u x · (more agreement w/ oracle direction) u : u =1 k k (i+1) (i+1) (i+1) w w(i+1) ( w = u w u w i i k k · ) u w(i) · u w(i+1) 27 · Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx 2 2 w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x  2 R = max x iR (x,y) Dk k 2  R w w(i+1) ( i ) ✓ 90 cos ✓ 0  w(i) x 0 ·  28 Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w(i+1) = w(i) + yx p iR 2 2 i w(i+1) = w(i) + yx 2 = w(i) + x 2 +2y(w(i) x) k k · 2 mistake on x w(i) + R2 x  2 R = max x iR (x,y) Dk k 2  R Combine with part 1: w w(i+1) ( i (i+1) (i+1) (i+1) ) w = u w u w i ✓ 90 k k · cos ✓ 0 i R2/2  w(i) x 0  ·  28 2 2 Convergence Bound R / where R = max xi i k k • is independent of: • dimensionality • number of examples • order of examples constant narrow margin: wide margin: • hard to separate easy to separate • and is dependent of: • separation difficulty (margin ẟ) • feature scale (radius R) • initial weight w(0) changes how fast it converges, but not whether it’ll converge • 29 Part V

• Limitations of Linear Classifiers and Feature Maps • XOR: not linearly separable • perceptron cycling theorem • solving XOR: non-linear feature map • “preview demo”: SVM with non-linear kernel • redefining “linear” separation under feature map

30 XOR

• XOR - not linearly separable • Nonlinear separation is trivial • Caveat from “Perceptrons” (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

31 Brief History of Perceptron

1997 batch Cortes/Vapnik +soft-margin SVM online approx. +kernels max margin subgradient descent 2007--2010* Singer group +max margin minibatch Pegasos minibatch online 2003 2006 conservative updates Crammer/Singer Singer group MIRA aggressive 1959 1962 1969* 1999 Rosenblatt Novikoff Minsky/Papert DEAD Freund/Schapire invention proof book killed it voted/avg: revived inseparable case

2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students 32 What if data is not separable • in practice, data is almost always inseparable • wait, what exactly does that mean? • perceptron cycling theorem (1970) • weights will remain bounded and will not diverge • use dev set for early stopping (prevents overfitting) • non-linearity (inseparable in low-dim => separable in high-dim) • higher-order features by combining atomic ones (cf. XOR) • a more systematic way: kernels (more details in week 5)

33 Solving XOR: Non-Linear Feature Map

(x1,x2) (x1,x2,x1x2) • XOR not linearly separable • Mapping into 3D makes it easily linearly separable • this mapping is actually non-linear (quadratic feature x1x2) • a special case of “ kernels” (week 5) • linear decision boundary in 3D => non-linear boundaries in 2D 34 Low-dimension <=> High-dimension

35 Low-dimension <=> High-dimension

not linearly separable in 2D

35 Low-dimension <=> High-dimension

not linearly separable in 2D linearly separable in 3D

35 Low-dimension <=> High-dimension

not linearly separable in 2D linearly separable in 3D

linear decision boundary in 3D

35 Low-dimension <=> High-dimension

not linearly separable in 2D linearly separable in 3D

non-linear boundaries in 2D linear decision boundary in 3D

35

Linear Separation under Feature Map

• we have to redefine separation and convergence theorem • dataset D is said to be linearly separable under feature map ϕ if there exists some unit oracle vector u: ∣∣u|| = 1 which correctly classifies every example (x, y) with a margin at least ẟ: y(u (x)) for all (x,y) D · 2 • then the perceptron must converge to a linear separator after at most R2/ẟ2 mistakes (updates) where R = max (x) (x,y) Dk k • in practice, the choice of feature map (“”)2 is often more important than the choice of learning • the first step of any machine learning project is data preprocessing: transform each (x, y) to (ϕ(x), y) • at testing time, also transform each x to ϕ(x)

• deep learning aims to automate feature engineering 37