The Perceptron Algorithm the Kernel Trick CS 4900/5900: Machine

CS 4900/5900: Machine Learning The Perceptron Algorithm The Kernel Trick Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected] 1 McCulloch-Pitts Neuron Function x0 1 activation / output w0 function w x1 1 w2 Σ f wi xi h (x) x2 ∑ w w3 x3 • Algebraic interpretation: – The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic weights. • weights wi correspond to the synaptic weights (activating or inhibiting). • summation corresponds to combination of signals in the soma. – It is often transformed through a monotonic activation function. 2 Activation/Output Functions 1 $" 0 if z < 0 unit step f (z) = # %$ 1 if z ≥ 0 Perceptron logistic 1 f (z) = −z 1+ e identity f (z) = z Logistic Regression Linear Regression 0 3 Perceptron x0 1 activation w0 function f w x1 1 w2 Σ ! T wi xi # 1 if w x > 0 x2 ∑ h (x) = w3 w " $" 0 if z < 0 # 0 otherwise f (z) = # $ %$ 1 if z ≥ 0 x3 • Assume classes T = {c1, c2} = {1, -1}. • Training set is (x1,t1), (x2,t2), … (xn,tn). T x = [1, x1, x2, ..., xk] h(x) = step(wTx) 4 Perceptron Learning T • Learning = finding the “right” parameters w = [w0, w1, … , wk ] – Find w that minimizes an error function E(w) which measures the misfit between h(xn,w) and tn. – Expect that h(x,w) performing well on training examples xn Þ h(x,w) will perform well on arbitrary test examples x Î X. • Least Squares error function? 1 N E(w) = {h(x , w)− t }2 2 ∑ n n n=1 # of mistakes 5 Least Squares vs. Perceptron Criterion • Least Squares => cost is # of misclassified patterns: – Piecewise constant function of w with discontinuities. – Cannot find closed form solution for w that minimizes cost. – Cannot use gradient methods (gradient zero almost everywhere). • Perceptron Criterion: T T – Set labels to be +1 and − 1. Want w xn > 0 for tn = 1, and w xn < 0 for tn = − 1. T Þ would like to have w xntn > 0 for all patterns. T Þ want to minimize −w xntn for all missclassified patterns M. * Þ minimize �" � = − ∑'∈) � �'�' 6 Stochastic Gradient Descent • Perceptron Criterion: * minimize �" � = − ∑'∈) � �'�' • Update parameters w sequentially after each mistake: (τ +1) (τ ) (τ ) w = w −η∇EP (w , xn ) (.) = � + ��'�' • The magnitude of w is inconsequential => set h = 1. (.23) (.) � = � + �'�' 7 The Perceptron Algorithm: Two Classes 1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn Theorem [Rosenblatt, 1962]: If the training dataset is linearly separable, the perceptron learning algorithm is guaranteed to find a solution in a finite number of steps. • see Theorem 1 (Block, Novikoff) in [Freund & Schapire, 1999]. 8 Averaged Perceptron: Two Classes 1. initialize parameters w = 0, t = 1, w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn 6. w = w + w 7. t = t + 1 8. return w /t During testing: h(�) = ��(�8 *�) 9 Linear vs. Non-linear Decision Boundaries ? And Or Xor j(x) = [1, x , x ]T 1 2 T T T => w j(x) = [w1, w2 ] [x1, x2 ] + w0 w = [w0 , w1, w2 ] 10 How to Find Non-linear Decision Boundaries 1) Logistic Regression with manually engineered features: – quadratic features. 2) Kernel methods (e.g. SVMs) with non-linear kernels: – quadratic kernels, Gaussian kernels. Deep Learning class 3) Unsupervised feature learning (e.g. auto-encoders): – Plug learned features in any linear classifier. 4) Neural Networks with one or more hidden layers: – Automatically learned features. 11 Non-Linear Classification: XOR Dataset x = [x1, x2] 12 1) Manually Engineered Features: Add x1x2 x = [x1, x2, x1x2] 13 Logistic Regression with Manually Engineered Features x = [x1, x2, x1x2] 14 Perceptron with Manually Engineered Features Project x = [x1, x2, x1x2] and decision hyperplane back to x = [x1, x2] 15 2) Kernel Methods with Non-Linear Kernels • Perceptrons, SVMs can be ‘kernelized’: 1. Re-write the algorithm such that during training and testing feature vectors x, y appear only in dot-products xTy. 2. Replace dot-products xTy with non-linear kernels K(x, y): • K is a kernel if and only if ∃� such that K(x, y) = �(x)T �(y) – � can be in a much higher dimensional space. » e.g. combinations of up to k original features – �(x)T �(y) can be computed efficiently without enumerating �(x) or �(y). 16 The Perceptron Algorithm: Two Classes 1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn Loop invariant: w is a weighted sum of training vectors: * * � = ; �'�'�' Þ � � = ; �'�'�'� ' ' 17 Kernel Perceptron: Two Classes * * 1. define � � = � � = ; �'�'�'� = ;�'�'�(�', �) ' ' 2. initialize dual parameters an = 0 3. for n = 1 … N 4. hn = sgn f(xn) 5. if hn ¹ tn then 6. an = an + 1 During testing: h(x) = sgn f(x) 18 Kernel Perceptron: Two Classes * * 1. define � � = � � = ; �'�'� = ;�'�(�', �) ' ' 2. initialize dual parameters an = 0 3. for n = 1 … N 4. hn = sgn f(xn) 5. if hn ¹ tn then 6. an = an + tn During testing: h(x) = sgn f(x) 19 The Perceptron vs. Boolean Functions ? And Or Xor j(x) = [1, x , x ]T 1 2 T T T => w j(x) = [w1, w2 ] [x1, x2 ] + w0 w = [w0 , w1, w2 ] 20 Perceptron with Quadratic Kernel • Discriminant function: T f (x) = åaitij(xi ) j(x) = åaiti K(xi ,x) i i • quadratic kernel: T 2 2 K(x,y) = (x y) = (x1 y1 + x2 y2 ) Þ corresponding feature space j(x) = ? conjunctions of two atomic features 21 Perceptron with Quadratic Kernel x j(x) d c 2 c 1 1 a 1 b 1 a b Linear kernel K(x,y) = xT y d quadratic kernel K(x,y) = (xT y)2 22 quadratic Kernels • Circles, hyperbolas, and ellipses as separating surfaces: K(x,y) = (1+ xT y)2 = j(x)T j(y) 2 2 T j(x) = [1, 2x1, 2x2 , x1 , 2x1x2 , x2 ] x2 x1 23 quadratic Kernels K(x,y) = (xT y)2 = j(x)T j(y) x j(x) 24 Explicit Features vs. Kernels • Explicitly enumerating features can be prohibitive: – 1,000 basic features for xTy => 500,500 quadratic features for (xTy)2 – Much worse for higher order features. • Solution: – Do not compute the feature vectors, compute kernels instead (i.e. compute dot products between implicit feature vectors). • (xTy)2 takes 1001 multiplications. • j(x)T j(y) in feature space takes 500,500 multiplications. 25 Kernel Functions • Definition: A function k : X ´ X ® R is a kernel function if there exists a feature mapping j : X ® Rn such that: k(x,y) = j(x)Tj(y) • Theorem: k : X ´ X ® R is a valid kernel Û the Gram matrix K whose elements are given by k(xn,xm) is positive semidefinite for all possible choices of the set {xn}. 26 Kernel Examples • Linear kernel: K(x,y) = xT y • quadratic kernel: K(x,y) = (c + xT y)2 – contains constant, linear terms and terms of order two (c > 0). • Polynomial kernel: K(x,y) = (c + xT y)M – contains all terms up to degree M (c > 0). • Gaussian kernel: K(x,y) = exp(- x - y 2 / 2s 2 ) – corresponding feature space has infinite dimensionality. 27 Techniques for Constructing Kernels 28 Kernels over Discrete Structures • Subsequence Kernels [Lodhi et al., JMLR 2002]: – S is a finite alphabet (set of symbols). – x,yÎS* are two sequences of symbols with lengths |x| and |y| – k(x,y) is defined as the number of common substrings of length n. – k(x,y) can be computed in O(n|x||y|) time complexity. • Tree Kernels [Collins and Duffy, NIPS 2001]: – T1 and T2 are two trees with N1 and N2 nodes respectively. – k(T1, T2) is defined as the number of common subtrees. – k(T1, T2) can be computed in O(N1N2) time complexity. – in practice, time is linear in the size of the trees. 29 Supplementary Reading • PRML Chapter 6: – Section 6.1 on dual representations for linear regression models. – Section 6.2 on techniques for constructing new kernels. 30 31 The Perceptron Algorithm: K classes 1. initialize parameters w = 0 2. for i = 1 … n T Repeat: 3. yi = arg max w j(xi ,t) tÎT 4. if y ¹ t then a) until convergence. i i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi,yi) During testing: t* = argmaxwTφ(x,t) t∈T 32 Averaged Perceptron: K classes 1. initialize parameters w = 0, t = 1, w = 0 2. for i = 1 … n T Repeat: 3. yi = arg max w j(xi ,t) tÎT 4. if y ¹ t then a) until convergence. i i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi,yi) 6. w = w + w 7. t = t + 1 8. return w /t During testing: t* = arg max wTj(x,t) tÎT 33 The Perceptron Algorithm: K classes 1. initialize parameters w = 0 2. for i = 1 … n T Repeat: 3. cj = arg max w j(xi ,t) tÎT 4. if c ¹ t then a) until convergence. j i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi, cj) Loop invariant: w is a weighted sum of training vectors: w = ∑αij (φ(xi, ti )−φ(xi,cj )) i, j Þ T T T w φ(x,t) = ∑αij (φ(xi, ti ) φ(x,t)−φ(xi,cj ) φ(x,t)) i, j 34 Kernel Perceptron: K classes 1.

The Perceptron Algorithm the Kernel Trick CS 4900/5900: Machine

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support