CS 4900/5900:

The Algorithm The Kernel Trick

Razvan C. Bunescu School of Electrical Engineering and Computer Science [email protected]

1 McCulloch-Pitts Neuron Function

x0 1 activation / output w0 function w x1 1 w2 Σ f wi xi h (x) x2 ∑ w w3

x3

• Algebraic interpretation: – The output of the neuron is a linear combination of inputs from other neurons, rescaled by the synaptic weights.

• weights wi correspond to the synaptic weights (activating or inhibiting). • summation corresponds to combination of signals in the soma. – It is often transformed through a monotonic activation function.

2 Activation/Output Functions

1 $" 0 if z < 0 unit step f (z) = # %$ 1 if z ≥ 0 Perceptron

logistic 1 f (z) = −z 1+ e identity f (z) = z 0

3 Perceptron

x0 1 activation w0 function f w x1 1 w2 Σ ! T wi xi # 1 if w x > 0 x2 ∑ h (x) = w3 w " $" 0 if z < 0 # 0 otherwise f (z) = # $ %$ 1 if z ≥ 0 x3

• Assume classes T = {c1, c2} = {1, -1}.

• Training set is (x1,t1), (x2,t2), … (xn,tn). T x = [1, x1, x2, ..., xk] h(x) = step(wTx)

4 Perceptron Learning

T • Learning = finding the “right” parameters w = [w0, w1, … , wk ] – Find w that minimizes an error function E(w) which measures the

misfit between h(xn,w) and tn.

– Expect that h(x,w) performing well on training examples xn Þ h(x,w) will perform well on arbitrary test examples x Î X.

• Least Squares error function?

1 N E(w) = {h(x , w)− t }2 2 ∑ n n n=1 # of mistakes

5 Least Squares vs. Perceptron Criterion

• Least Squares => cost is # of misclassified patterns: – Piecewise constant function of w with discontinuities. – Cannot find closed form solution for w that minimizes cost. – Cannot use gradient methods (gradient zero almost everywhere).

• Perceptron Criterion: T T – Set labels to be +1 and − 1. Want w xn > 0 for tn = 1, and w xn < 0 for tn = − 1. T Þ would like to have w xntn > 0 for all patterns. T Þ want to minimize −w xntn for all missclassified patterns M. Þ minimize � � = − ∑∈ � ��

6 Stochastic Gradient Descent

• Perceptron Criterion: minimize � � = − ∑∈ � ��

• Update parameters w sequentially after each mistake: (τ +1) (τ ) (τ ) w = w −η∇EP (w , xn ) () = � + ���

• The magnitude of w is inconsequential => set h = 1. () () � = � + ��

7 The Perceptron Algorithm: Two Classes

1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn

Theorem [Rosenblatt, 1962]: If the training dataset is linearly separable, the perceptron learning algorithm is guaranteed to find a solution in a finite number of steps. • see Theorem 1 (Block, Novikoff) in [Freund & Schapire, 1999].

8 Averaged Perceptron: Two Classes

1. initialize parameters w = 0, t = 1, w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn 6. w = w + w 7. t = t + 1 8. return w /t

During testing: h(�) = ���(� �)

9 Linear vs. Non-linear Decision Boundaries

?

And Or Xor

j(x) = [1, x , x ]T 1 2 T T T => w j(x) = [w1, w2 ] [x1, x2 ] + w0 w = [w0 , w1, w2 ]

10 How to Find Non-linear Decision Boundaries

1) Logistic Regression with manually engineered features: – Quadratic features.

2) Kernel methods (e.g. SVMs) with non-linear kernels: – Quadratic kernels, Gaussian kernels. class 3) Unsupervised (e.g. auto-encoders): – Plug learned features in any .

4) Neural Networks with one or more hidden layers: – Automatically learned features.

11 Non-Linear Classification: XOR Dataset

x = [x1, x2]

12 1) Manually Engineered Features: Add x1x2

x = [x1, x2, x1x2]

13 Logistic Regression with Manually Engineered Features

x = [x1, x2, x1x2]

14 Perceptron with Manually Engineered Features

Project x = [x1, x2, x1x2] and decision hyperplane back to x = [x1, x2]

15 2) Kernel Methods with Non-Linear Kernels

, SVMs can be ‘kernelized’: 1. Re-write the algorithm such that during training and testing feature vectors x, y appear only in dot-products xTy. 2. Replace dot-products xTy with non-linear kernels K(x, y): • K is a kernel if and only if ∃� such that K(x, y) = �(x)T �(y) – � can be in a much higher dimensional space. » e.g. combinations of up to k original features – �(x)T �(y) can be computed efficiently without enumerating �(x) or �(y).

16 The Perceptron Algorithm: Two Classes

1. initialize parameters w = 0 2. for n = 1 … N T 3. hn = sgn(w xn) Repeat: 4. if h ¹ t then a) until convergence. n n b) for a number of epochs E. 5. w = w + tnxn

Loop invariant: w is a weighted sum of training vectors:

� = ��� Þ � � = ����

17 Kernel Perceptron: Two Classes

1. define � � = � � = ���� = ���(�, �) 2. initialize dual parameters an = 0 3. for n = 1 … N

4. hn = sgn f(xn)

5. if hn ¹ tn then

6. an = an + 1

During testing: h(x) = sgn f(x)

18 Kernel Perceptron: Two Classes

1. define � � = � � = ��� = ��(�, �) 2. initialize dual parameters an = 0 3. for n = 1 … N

4. hn = sgn f(xn)

5. if hn ¹ tn then

6. an = an + tn

During testing: h(x) = sgn f(x)

19 The Perceptron vs. Boolean Functions

?

And Or Xor

j(x) = [1, x , x ]T 1 2 T T T => w j(x) = [w1, w2 ] [x1, x2 ] + w0 w = [w0 , w1, w2 ]

20 Perceptron with Quadratic Kernel

• Discriminant function:

T f (x) = åaitij(xi ) j(x) = åaiti K(xi ,x) i i • Quadratic kernel:

T 2 2 K(x,y) = (x y) = (x1 y1 + x2 y2 )

Þ corresponding feature space j(x) = ?

conjunctions of two atomic features

21 Perceptron with Quadratic Kernel

x j(x)

d c 2 c 1 1 a 1 b 1 a b

Linear kernel K(x,y) = xT y d Quadratic kernel K(x,y) = (xT y)2

22 Quadratic Kernels

• Circles, hyperbolas, and ellipses as separating surfaces: K(x,y) = (1+ xT y)2 = j(x)T j(y) 2 2 T j(x) = [1, 2x1, 2x2 , x1 , 2x1x2 , x2 ] x2

x1

23 Quadratic Kernels

K(x,y) = (xT y)2 = j(x)T j(y)

x j(x)

24 Explicit Features vs. Kernels

• Explicitly enumerating features can be prohibitive: – 1,000 basic features for xTy => 500,500 quadratic features for (xTy)2 – Much worse for higher order features.

• Solution: – Do not compute the feature vectors, compute kernels instead (i.e. compute dot products between implicit feature vectors). • (xTy)2 takes 1001 multiplications. • j(x)T j(y) in feature space takes 500,500 multiplications.

25 Kernel Functions

• Definition: A function k : X ´ X ® R is a kernel function if there exists a feature mapping j : X ® Rn such that: k(x,y) = j(x)Tj(y)

• Theorem: k : X ´ X ® R is a valid kernel Û the Gram matrix K

whose elements are given by k(xn,xm) is positive semidefinite for all possible choices of the set {xn}.

26 Kernel Examples

• Linear kernel: K(x,y) = xT y

• Quadratic kernel: K(x,y) = (c + xT y)2 – contains constant, linear terms and terms of order two (c > 0).

• Polynomial kernel: K(x,y) = (c + xT y)M – contains all terms up to degree M (c > 0).

• Gaussian kernel: K(x,y) = exp(- x - y 2 / 2s 2 ) – corresponding feature space has infinite dimensionality.

27 Techniques for Constructing Kernels

28 Kernels over Discrete Structures

• Subsequence Kernels [Lodhi et al., JMLR 2002]: – S is a finite alphabet (set of symbols). – x,yÎS* are two sequences of symbols with lengths |x| and |y| – k(x,y) is defined as the number of common substrings of length n. – k(x,y) can be computed in O(n|x||y|) time complexity.

• Tree Kernels [Collins and Duffy, NIPS 2001]:

– T1 and T2 are two trees with N1 and N2 nodes respectively.

– k(T1, T2) is defined as the number of common subtrees.

– k(T1, T2) can be computed in O(N1N2) time complexity. – in practice, time is linear in the size of the trees.

29 Supplementary Reading

• PRML Chapter 6: – Section 6.1 on dual representations for linear regression models. – Section 6.2 on techniques for constructing new kernels.

30 31 The Perceptron Algorithm: K classes

1. initialize parameters w = 0 2. for i = 1 … n T Repeat: 3. yi = arg max w j(xi ,t) tÎT 4. if y ¹ t then a) until convergence. i i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi,yi)

During testing: t* = argmaxwTφ(x,t) t∈T

32 Averaged Perceptron: K classes

1. initialize parameters w = 0, t = 1, w = 0 2. for i = 1 … n T Repeat: 3. yi = arg max w j(xi ,t) tÎT 4. if y ¹ t then a) until convergence. i i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi,yi) 6. w = w + w 7. t = t + 1 8. return w /t During testing: t* = arg max wTj(x,t) tÎT

33 The Perceptron Algorithm: K classes

1. initialize parameters w = 0 2. for i = 1 … n T Repeat: 3. cj = arg max w j(xi ,t) tÎT 4. if c ¹ t then a) until convergence. j i b) for a number of epochs E. 5. w = w + j(xi,ti) - j(xi, cj)

Loop invariant: w is a weighted sum of training vectors:

w = ∑αij (φ(xi, ti )−φ(xi,cj )) i, j Þ T T T w φ(x,t) = ∑αij (φ(xi, ti ) φ(x,t)−φ(xi,cj ) φ(x,t)) i, j 34 Kernel Perceptron: K classes

1. define T T f (x,t) = ∑αij (φ(xi, ti ) φ(x,t)−φ(xi,cj ) φ(x,t)) i, j 2. initialize dual parameters aij = 0 3. for i = 1 … n

4. cj = argmax f (xi ,t) Repeat: tÎT 5. if y ¹ t then a) until convergence. i i b) for a number of epochs E. 6. aij = aij + 1

During testing: t* = arg max f (x,t) tÎT

35 Kernel Perceptron: K classes

• Discriminant function: T T f (x,t) = ∑αi, j (φ(xi, ti ) φ(x,t)−φ(xi,cj ) φ(x,t)) i, j

= ∑αij (K(xi, ti, x,t)− K(xi,cj, x,t)) i, j where:

T K(xi ,ti ,x,t) = j (xi ,ti )j(x,t)

T K(xi, yi, x,t) = φ (xi, yi )φ(x,t)

36