The Perceptron Algorithm the Kernel Trick CS 4900/5900: Machine

CS 4900/5900: Machine Learning The Perceptron Algorithm The Kernel Trick Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected] 1 McCulloch-Pitts Neuron Function x0 1 activation / output w0 function w x1 1 w2 Σ f wi xi h (x) x2 ∑ w w3 x3 • Algebraic interpretation: – The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic weights. • weights wi correspond to the synaptic weights (activating or inhibiting). • summation corresponds to combination of signals in the soma. – It is often transformed through a monotonic activation function. 2 Activation/Output Functions 1 $" 0 if z < 0 unit step f (z) = # %$ 1 if z ≥ 0 Perceptron logistic 1 f (z) = −z 1+ e identity f (z) = z Logistic Regression Linear Regression 0 3 Perceptron x0 1 activation w0 function f w x1 1 w2 Σ ! T wi xi # 1 if w x > 0 x2 ∑ h (x) = w3 w " $" 0 if z < 0 # 0 otherwise f (z) = # $ %$ 1 if z ≥ 0 x3 • Assume classes T = {c1, c2} = {1, -1}. • Training set is (x1,t1), (x2,t2), … (xn,tn). T x = [1, x1, x2, ..., xk] h(x) = step(wTx) 4 Perceptron Learning T • Learning = finding the “right” parameters w = [w0, w1, … , wk ] – Find w that minimizes an error function E(w) which measures the misfit between h(xn,w) and tn. – Expect that h(x,w) performing well on training examples xn Þ h(x,w) will perform well on arbitrary test examples x Î X. • Least Squares error function? 1 N E(w) = {h(x , w)− t }2 2 ∑ n n n=1 # of mistakes 5 Least Squares vs. Perceptron Criterion • Least Squares => cost is # of misclassified patterns: – Piecewise constant function of w with discontinuities. – Cannot find closed form solution for w that minimizes cost. – Cannot use gradient methods (gradient zero almost everywhere). • Perceptron Criterion: T T – Set labels to be +1 and − 1. Want w xn > 0 for tn = 1, and w xn < 0 for tn = − 1. T Þ would like to have w xntn > 0 for all patterns. T Þ want to minimize −w xntn for all missclassified patterns M. * Þ minimize �" � = − ∑'∈) � �'�' 6 Stochastic Gradient Descent • Perceptron Criterion: * minimize �" � = − ∑'∈) � �'�' • Update parameters w sequentially after each mistake: (τ +1) (τ ) (τ ) w = w −η∇EP (w , xn ) (.) = � + ��'�' • The magnitude of w is inconsequential => set h = 1. (.23) (.) � = � + �'�' 7 The Perceptron Algorithm: Two Classes 1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn Theorem [Rosenblatt, 1962]: If the training dataset is linearly separable, the perceptron learning algorithm is guaranteed to find a solution in a finite number of steps. • see Theorem 1 (Block, Novikoff) in [Freund & Schapire, 1999]. 8 Averaged Perceptron: Two Classes 1. initialize parameters w = 0, t = 1, w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn 6. w = w + w 7. t = t + 1 8. return w /t During testing: h(�) = ��(�8 *�) 9 Linear vs. Non-linear Decision Boundaries ? And Or Xor j(x) = [1, x , x ]T 1 2 T T T => w j(x) = [w1, w2 ] [x1, x2 ] + w0 w = [w0 , w1, w2 ] 10 How to Find Non-linear Decision Boundaries 1) Logistic Regression with manually engineered features: – quadratic features. 2) Kernel methods (e.g. SVMs) with non-linear kernels: – quadratic kernels, Gaussian kernels. Deep Learning class 3) Unsupervised feature learning (e.g. auto-encoders): – Plug learned features in any linear classifier. 4) Neural Networks with one or more hidden layers: – Automatically learned features. 11 Non-Linear Classification: XOR Dataset x = [x1, x2] 12 1) Manually Engineered Features: Add x1x2 x = [x1, x2, x1x2] 13 Logistic Regression with Manually Engineered Features x = [x1, x2, x1x2] 14 Perceptron with Manually Engineered Features Project x = [x1, x2, x1x2] and decision hyperplane back to x = [x1, x2] 15 2) Kernel Methods with Non-Linear Kernels • Perceptrons, SVMs can be ‘kernelized’: 1. Re-write the algorithm such that during training and testing feature vectors x, y appear only in dot-products xTy. 2. Replace dot-products xTy with non-linear kernels K(x, y): • K is a kernel if and only if ∃� such that K(x, y) = �(x)T �(y) – � can be in a much higher dimensional space. » e.g. combinations of up to k original features – �(x)T �(y) can be computed efficiently without enumerating �(x) or �(y). 16 The Perceptron Algorithm: Two Classes 1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn Loop invariant: w is a weighted sum of training vectors: * * � = ; �'�'�' Þ � � = ; �'�'�'� ' ' 17 Kernel Perceptron: Two Classes * * 1. define � � = � � = ; �'�'�'� = ;�'�'�(�', �) ' ' 2. initialize dual parameters an = 0 3. for n = 1 … N 4. hn = sgn f(xn) 5. if hn ¹ tn then 6. an = an + 1 During testing: h(x) = sgn f(x) 18 Kernel Perceptron: Two Classes * * 1. define � � = � � = ; �'�'� = ;�'�(�', �) ' ' 2. initialize dual parameters an = 0 3. for n = 1 … N 4. hn = sgn f(xn) 5. if hn ¹ tn then 6. an = an + tn During testing: h(x) = sgn f(x) 19 The Perceptron vs. Boolean Functions ? And Or Xor j(x) = [1, x , x ]T 1 2 T T T => w j(x) = [w1, w2 ] [x1, x2 ] + w0 w = [w0 , w1, w2 ] 20 Perceptron with Quadratic Kernel • Discriminant function: T f (x) = åaitij(xi ) j(x) = åaiti K(xi ,x) i i • quadratic kernel: T 2 2 K(x,y) = (x y) = (x1 y1 + x2 y2 ) Þ corresponding feature space j(x) = ? conjunctions of two atomic features 21 Perceptron with Quadratic Kernel x j(x) d c 2 c 1 1 a 1 b 1 a b Linear kernel K(x,y) = xT y d quadratic kernel K(x,y) = (xT y)2 22 quadratic Kernels • Circles, hyperbolas, and ellipses as separating surfaces: K(x,y) = (1+ xT y)2 = j(x)T j(y) 2 2 T j(x) = [1, 2x1, 2x2 , x1 , 2x1x2 , x2 ] x2 x1 23 quadratic Kernels K(x,y) = (xT y)2 = j(x)T j(y) x j(x) 24 Explicit Features vs. Kernels • Explicitly enumerating features can be prohibitive: – 1,000 basic features for xTy => 500,500 quadratic features for (xTy)2 – Much worse for higher order features. • Solution: – Do not compute the feature vectors, compute kernels instead (i.e. compute dot products between implicit feature vectors). • (xTy)2 takes 1001 multiplications. • j(x)T j(y) in feature space takes 500,500 multiplications. 25 Kernel Functions • Definition: A function k : X ´ X ® R is a kernel function if there exists a feature mapping j : X ® Rn such that: k(x,y) = j(x)Tj(y) • Theorem: k : X ´ X ® R is a valid kernel Û the Gram matrix K whose elements are given by k(xn,xm) is positive semidefinite for all possible choices of the set {xn}. 26 Kernel Examples • Linear kernel: K(x,y) = xT y • quadratic kernel: K(x,y) = (c + xT y)2 – contains constant, linear terms and terms of order two (c > 0). • Polynomial kernel: K(x,y) = (c + xT y)M – contains all terms up to degree M (c > 0). • Gaussian kernel: K(x,y) = exp(- x - y 2 / 2s 2 ) – corresponding feature space has infinite dimensionality. 27 Techniques for Constructing Kernels 28 Kernels over Discrete Structures • Subsequence Kernels [Lodhi et al., JMLR 2002]: – S is a finite alphabet (set of symbols). – x,yÎS* are two sequences of symbols with lengths |x| and |y| – k(x,y) is defined as the number of common substrings of length n. – k(x,y) can be computed in O(n|x||y|) time complexity. • Tree Kernels [Collins and Duffy, NIPS 2001]: – T1 and T2 are two trees with N1 and N2 nodes respectively. – k(T1, T2) is defined as the number of common subtrees. – k(T1, T2) can be computed in O(N1N2) time complexity. – in practice, time is linear in the size of the trees. 29 Supplementary Reading • PRML Chapter 6: – Section 6.1 on dual representations for linear regression models. – Section 6.2 on techniques for constructing new kernels. 30 31 The Perceptron Algorithm: K classes 1. initialize parameters w = 0 2. for i = 1 … n T Repeat: 3. yi = arg max w j(xi ,t) tÎT 4. if y ¹ t then a) until convergence. i i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi,yi) During testing: t* = argmaxwTφ(x,t) t∈T 32 Averaged Perceptron: K classes 1. initialize parameters w = 0, t = 1, w = 0 2. for i = 1 … n T Repeat: 3. yi = arg max w j(xi ,t) tÎT 4. if y ¹ t then a) until convergence. i i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi,yi) 6. w = w + w 7. t = t + 1 8. return w /t During testing: t* = arg max wTj(x,t) tÎT 33 The Perceptron Algorithm: K classes 1. initialize parameters w = 0 2. for i = 1 … n T Repeat: 3. cj = arg max w j(xi ,t) tÎT 4. if c ¹ t then a) until convergence. j i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi, cj) Loop invariant: w is a weighted sum of training vectors: w = ∑αij (φ(xi, ti )−φ(xi,cj )) i, j Þ T T T w φ(x,t) = ∑αij (φ(xi, ti ) φ(x,t)−φ(xi,cj ) φ(x,t)) i, j 34 Kernel Perceptron: K classes 1.

Load more