CS 4900/5900: Machine Learning
The Perceptron Algorithm The Kernel Trick
Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected]
1 McCulloch-Pitts Neuron Function
x0 1 activation / output w0 function w x1 1 w2 Σ f wi xi h (x) x2 ∑ w w3
x3
• Algebraic interpretation: – The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic weights.
• weights wi correspond to the synaptic weights (activating or inhibiting). • summation corresponds to combination of signals in the soma. – It is often transformed through a monotonic activation function.
2 Activation/Output Functions
1 $" 0 if z < 0 unit step f (z) = # %$ 1 if z ≥ 0 Perceptron
logistic 1 f (z) = −z 1+ e identity f (z) = z Logistic Regression Linear Regression 0
3 Perceptron
x0 1 activation w0 function f w x1 1 w2 Σ ! T wi xi # 1 if w x > 0 x2 ∑ h (x) = w3 w " $" 0 if z < 0 # 0 otherwise f (z) = # $ %$ 1 if z ≥ 0 x3
• Assume classes T = {c1, c2} = {1, -1}.
• Training set is (x1,t1), (x2,t2), … (xn,tn). T x = [1, x1, x2, ..., xk] h(x) = step(wTx)
4 Perceptron Learning
T • Learning = finding the “right” parameters w = [w0, w1, … , wk ] – Find w that minimizes an error function E(w) which measures the
misfit between h(xn,w) and tn.
– Expect that h(x,w) performing well on training examples xn Þ h(x,w) will perform well on arbitrary test examples x Î X.
• Least Squares error function?
1 N E(w) = {h(x , w)− t }2 2 ∑ n n n=1 # of mistakes
5 Least Squares vs. Perceptron Criterion
• Least Squares => cost is # of misclassified patterns: – Piecewise constant function of w with discontinuities. – Cannot find closed form solution for w that minimizes cost. – Cannot use gradient methods (gradient zero almost everywhere).
• Perceptron Criterion: T T – Set labels to be +1 and − 1. Want w xn > 0 for tn = 1, and w xn < 0 for tn = − 1. T Þ would like to have w xntn > 0 for all patterns. T Þ want to minimize −w xntn for all missclassified patterns M. Þ minimize � � = − ∑ ∈ � � �
6 Stochastic Gradient Descent
• Perceptron Criterion: minimize � � = − ∑ ∈ � � �
• Update parameters w sequentially after each mistake: (τ +1) (τ ) (τ ) w = w −η∇EP (w , xn ) ( ) = � + �� �
• The magnitude of w is inconsequential => set h = 1. ( ) ( ) � = � + � �
7 The Perceptron Algorithm: Two Classes
1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn
Theorem [Rosenblatt, 1962]: If the training dataset is linearly separable, the perceptron learning algorithm is guaranteed to find a solution in a finite number of steps. • see Theorem 1 (Block, Novikoff) in [Freund & Schapire, 1999].
8 Averaged Perceptron: Two Classes
1. initialize parameters w = 0, t = 1, w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn 6. w = w + w 7. t = t + 1 8. return w /t
During testing: h(�) = ���(� �)
9 Linear vs. Non-linear Decision Boundaries
?
And Or Xor
j(x) = [1, x , x ]T 1 2 T T T => w j(x) = [w1, w2 ] [x1, x2 ] + w0 w = [w0 , w1, w2 ]
10 How to Find Non-linear Decision Boundaries
1) Logistic Regression with manually engineered features: – Quadratic features.
2) Kernel methods (e.g. SVMs) with non-linear kernels: – Quadratic kernels, Gaussian kernels. Deep Learning class 3) Unsupervised feature learning (e.g. auto-encoders): – Plug learned features in any linear classifier.
4) Neural Networks with one or more hidden layers: – Automatically learned features.
11 Non-Linear Classification: XOR Dataset
x = [x1, x2]
12 1) Manually Engineered Features: Add x1x2
x = [x1, x2, x1x2]
13 Logistic Regression with Manually Engineered Features
x = [x1, x2, x1x2]
14 Perceptron with Manually Engineered Features
Project x = [x1, x2, x1x2] and decision hyperplane back to x = [x1, x2]
15 2) Kernel Methods with Non-Linear Kernels
• Perceptrons, SVMs can be ‘kernelized’: 1. Re-write the algorithm such that during training and testing feature vectors x, y appear only in dot-products xTy. 2. Replace dot-products xTy with non-linear kernels K(x, y): • K is a kernel if and only if ∃� such that K(x, y) = �(x)T �(y) – � can be in a much higher dimensional space. » e.g. combinations of up to k original features – �(x)T �(y) can be computed efficiently without enumerating �(x) or �(y).
16 The Perceptron Algorithm: Two Classes
1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn
Loop invariant: w is a weighted sum of training vectors: