Administrivia Kaggle https://www.kaggle.com/competitions

Mini-project 2 due April 7, in class! ‣ implement multi-class reductions, naive bayes, kernel , multi-class and two layer neural networks ‣ training set: ! ! ! ! ! Project proposals due April 2, in class! ‣ one page describing the project topic, goals, etc ‣ list your team members (2+) ‣ project presentations: April 23 and 27 ‣ final report: May 3

CMPSCI 689 Subhransu Maji (UMASS) 1/27 CMPSCI 689 Subhransu Maji (UMASS) 2/27

Feature mapping

Learn non-linear classifiers by mapping features

Kernel Methods Subhransu Maji CMPSCI 689: 24 March 2015 26 March 2015

Can we learn the XOR function with this mapping?

CMPSCI 689 Subhransu Maji (UMASS) 4/27 Quadratic feature map Drawbacks of feature mapping

Let, x! =[x1,x2,...,xD] Computational! Then the quadratic feature map is defined as:! ‣ Suppose training time is linear in feature dimension, quadratic feature map squares the training time ! (x)=[1, p2x , p2x ,...,p2x , Memory! ! 1 2 D 2 ‣ Quadratic feature map squares the memory required to store the ! x ,x1x2,x1x3 ...,x1xD, 1 training data ! 2 x2x1,x2,x2x3 ...,x2xD, Statistical! ! ..., ‣ Quadratic feature mapping squares the number of parameters ! 2 ‣ For now lets assume that regularization will deal with overfitting ! xDx1,xDx2,xDx3 ...,xD] ! Contains all single and pairwise terms!

There are repetitions, e.g., x1x2 and x2x1, but hopefully the learning algorithm can handle redundant features

CMPSCI 689 Subhransu Maji (UMASS) 5/27 CMPSCI 689 Subhransu Maji (UMASS) 6/27

Quadratic kernel Perceptron revisited

The dot product between feature maps of x and z is:! Input: training data (x1,y1), (x2,y2),...,(xn,yn) feature map ! T 2 2 (x) (z)=1+2x1z1 +2x2z2,...,2xDzD + x1z1 + x1x2z1z2 + ...+ x1xDz1zD + ... ! 2 2 ...+ xDx1zDz1 + xDx2zDz2 + ...+ xDzD Perceptron training algorithm ! Initialize w! [0,...,0] =1+2 x z + x x z z ! i i i j i j i ! i,j for iter = 1,…,T! X X ! =1+2 xT z +(xT z)2 ‣for i = 1,..,n! ! 2 = 1+xT z • predict according to the current model! ! = K(x, z) quadratic kernel ! T ! +1 if w (xi) > 0 y!ˆi = Thus, we can compute φ(x)ᵀφ(z) in almost the same time as needed 1 otherwise ! ⇢ dependence on φ to compute xᵀz (one extra addition and multiplication)! if y i =ˆ y i , no change! We will rewrite various algorithms using only dot products (or kernel • else, w w + y (x ) evaluations), and not explicit features • i i Obtained by replacing x by φ(x)

CMPSCI 689 Subhransu Maji (UMASS) 7/27 CMPSCI 689 Subhransu Maji (UMASS) 8/19 Properties of the weight vector Kernelized perceptron

Linear algebra recap:! Input: training data (x1,y1), (x2,y2),...,(xn,yn) feature map ‣ Let U be set of vectors in Rᴰ, i.e., U = {u1,u2, …, uD} and ui ∈ Rᴰ ‣ Span(U) is the set of all vectors that can be represented as ∑ᵢaᵢuᵢ, Kernelized perceptron training algorithm such that aᵢ ∈ R Initialize !↵ [0, 0,...,0] ‣ Null(U) is everything that is left i.e., Rᴰ \ Span(U) for iter = 1,…,T! ‣for i = 1,..,n! • predict according to the current model! ! +1 if ↵ (x )T (x ) > 0 yˆ = n n n i Perceptron representer theorem: During the run of the perceptron !i 1 otherwise training algorithm, the weight vector w is always in the span of φ(x1), ! ⇢ P φ(x1), …, φ(xD) • if y i =ˆ y i , no change! w = ↵ (x ) updates ↵ ↵ + y • else, ↵i = ↵i + yi i i i i i i T P T T T T p polynomial kernel of degree p w (z)=( i ↵i(xi)) (z)= i ↵i(xi) (z) (x) (z)=(1+x z) CMPSCI 689 Subhransu Maji (UMASS) 9/27 CMPSCI 689 Subhransu Maji (UMASS) 10/19 P P Support vector machines Kernel k-means

Kernels existed long before SVMs, but were popularized by them! Initialize k centers by picking k points randomly! Does the representer theorem hold for SVMs?! Repeat till convergence (or max iterations)! Recall that the objective function of an SVM is:! ‣ Assign each point to the nearest center (assignment step) ! ! k 1 2 T 2 ! min w + C max(0, 1 ynw xn) ! arg min (x) µi w 2|| || S || || n i=1 x S ! X ! X X2 i Let, w = w + w ‣ Estimate the mean of each group (update step) k ? ! k only w|| affects classification norm decomposes 2 ! arg min (x) µi T T T T S || || w xi =(w + w ) xi w w =(w + w ) (w + w ) i=1 x Si k ? k ? k ? ! X X2 T T T T 1 = w xi + w xi = w w + w w Representer theorem is easy here — µ! i (x) k ? k k ? ? Si T T ! | | x Si = w xi w w X2 k k k ! 2 Hence, w Span( x , x ,...,x ) Exercise: show how to compute ( x ) µ i using dot products 2 { 1 2 n} || || CMPSCI 689 Subhransu Maji (UMASS) 11/27 CMPSCI 689 Subhransu Maji (UMASS) 12/37 What makes a kernel? Why is this characterization useful?

A kernel is a mapping K: X xX →R! We can prove some properties about kernels that are otherwise hard Functions that can be written as dot products are valid kernels! to prove! Theorem: If K1 and K2 are kernels, then K1 + K2 is also a kernel! K(x, z)=(x)T (z) ! Proof:! d T d Examples: polynomial kernel! K(poly)(x, z)=(1+x z) ! f(x)K(x, z)f(z)dzdx = f(x)(K1(x, z)+K2(x, z)) f(z)dzdx ! !ZZ ZZ = f(x)K (x, z)f(z)dzdx + f(x)K (x, z)f(z)dzdx Alternate characterization of a kernel! ! 1 2 ZZ ZZ 0+0 A function K: X xX →R is a kernel if K is positive semi-definite (psd)! ! This property is also called as Mercer’s condition! More generally if K1, K2,…, Kn are kernels then ∑ᵢαi Ki with αi ≥ 0, is a also a kernel! This means that for all functions f that are squared integrable except Can build new kernels by linearly combining existing kernels the zero function, the following property holds:

f(x)K(x, z)f(z)dzdx > 0 f(x)2dx < 1 ZZ Z

CMPSCI 689 Subhransu Maji (UMASS) 13/27 CMPSCI 689 Subhransu Maji (UMASS) 14/27

Why is this characterization useful? Kernels in practice

We can show that the Gaussian function is a kernel! Feature mapping via kernels often improves performance! ‣ Also called as radial basis function (RBF) kernel MNIST digits test error:! 2 ‣ 8.4% SVM linear K(rbf)(x, z)=exp x z ! || || ‣ 1.4% SVM RBF Lets look at the classification function using a SVM with RBF kernel:! ‣ 1.1% SVM polynomial (d=4) ! ! 60,000 training examples !f(z)= ↵iK(rbf)(xi, z) i ! X = ↵ exp x z 2 ! i || i || i ! X ! This is similar to a two layer network with the RBF as the link function! Gaussian kernels are examples of universal kernels — they can approximate any function in the limit as training data goes to infinity http://yann.lecun.com/exdb/mnist/ CMPSCI 689 Subhransu Maji (UMASS) 15/27 CMPSCI 689 Subhransu Maji (UMASS) 16/27 Kernels over general structures Kernels for computer vision

Kernels can be defined over any pair of inputs such as strings, trees Histogram intersection kernel between two histograms a and b ! and graphs!! ! Kernel over trees:! ! ! ! ! , = number of common! ! K! subtrees ! ! ( ) a ! ! http://en.wikipedia.org/wiki/Tree_kernel ! b ! ! min(a,b) ‣ This can be computed efficiently using dynamic programming ! ‣ Can be used with SVMs, , k-means, etc ! For strings number of common substrings is a kernel! Introduced by Swain and Ballard 1991 to compare color histograms Graph kernels that measure graph similarity (e.g. number of common subgraphs) have been used to predict toxicity of chemical structures

CMPSCI 689 Subhransu Maji (UMASS) 17/27 CMPSCI 689 Subhransu Maji (UMASS) 18/27

Kernel classifiers tradeoffs Kernel classification function N N D h(z)= ↵ K(x , z)= ↵ min(x ,z ) i i i 0 ij j 1 i=1 i=1 j=1 X X X @ A

Non6linear2Kernel

N

Evaluation2time Linear2Kernel h(z)= ↵iK(xi, z) i=1 h(z)=wT z X Accuracy

Linear:222222222222O2(feature2dimension)2 Non2Linear:222O2(N"X2feature2dimension)

CMPSCI 689 Subhransu Maji (UMASS) 19/27 CMPSCI 689 Subhransu Maji (UMASS) 20/27 Kernel classification function Kernel classification function N N D N N D h(z)= ↵ K(x , z)= ↵ min(x ,z ) h(z)= ↵ K(x , z)= ↵ min(x ,z ) i i i 0 ij j 1 i i i 0 ij j 1 i=1 i=1 j=1 i=1 i=1 j=1 X X X X X X @ A @ A Key insight: additive property Algorithm 1 N N D hj(zj)= ↵i min(xij,zj) O(N) h(z)= ↵i min(xij,zj) 0 1 i=1 i=1 j=1 X X X D @N A = ↵i min(xij,zj) j=1 i=1 ! X X D N

= hj(zj) hj(zj)= ↵i min(xij,zj) j=1 i=1 X X

CMPSCI 689 Subhransu Maji (UMASS) 21/27 CMPSCI 689 Subhransu Maji (UMASS) 22/27

Kernel classification function Kernel classification function N N D N N D h(z)= ↵ K(x , z)= ↵ min(x ,z ) h(z)= ↵ K(x , z)= ↵ min(x ,z ) i i i 0 ij j 1 i i i 0 ij j 1 i=1 i=1 j=1 i=1 i=1 j=1 X X X X X X @ [Maji et Aal. PAMI 13] @ [Maji et Aal. PAMI 13] Algorithm 1 Algorithm 2 N N hj(zj)= ↵i min(xij,zj) O(N) hj(zj)= ↵i min(xij,zj) O(N) i=1 i=1 X X N N N N = ↵ixij + ↵izj O(log N) = ↵ixij + ↵izj O(log N)

i:xij

For many problems hi is smooth (blue plot). Hence, Sort the support vector values in To evaluate, find the position of zj the each coordinate, and pre-compute sorted list of support vector xij .This can be we can approximate it with fewer uniformly spaced these sums for each rank. done in O(log N) time using binary search. segments (red plot). This saves time and space!

CMPSCI 689 Subhransu Maji (UMASS) 23/27 CMPSCI 689 Subhransu Maji (UMASS) 24/27 Kernel classification function Linear and intersection kernel SVM N N D Using histograms of oriented gradients feature: h(z)= ↵ K(x , z)= ↵ min(x ,z ) i i i 0 ij j 1 i=1 i=1 j=1 Dataset Measure Linear SVM IK SVM Speedup X X X @ [Maji et Aal. PAMI 13] INRIA pedestrians Recall@ 2 FPPI 78.9 86.6 2594 X Algorithm 2 DC pedestrians Accuracy 72.2 89.0 2253 X D Caltech101, 15 examples Accuracy 38.8 50.1 37 X K(x, z)= ki(xi,zi) i=1 Caltech101, 30 examples Accuracy 44.3 56.6 62 X X O(N) additive kernels MNIST digits Error 1.44 0.77 2500 X

Intersection k(a, b)=min(a, b) UIUC cars (Single Scale) Precision@ EER 89.8 98.5 65 X O(1) 2ab On average more accurate than linear and 100-1000x faster than Chi6squared k(a, b)= a + b standard kernel classifier. Similar idea can be applied to training as well. a + b a + b Research question: when can we approximate kernels efficiently? Jensen6Shannon k(a, b)=a log + b log a b ✓ ◆ ✓ ◆ CMPSCI 689 Subhransu Maji (UMASS) 25/27 CMPSCI 689 Subhransu Maji (UMASS) 26/27

Slides credit

Some of the slides are based on CIML book by Hal Daume III! Experiments on various datasets: “Efficient Classification for Additive Kernel SVMs”, S. Maji, A. C. Berg and J. Malik, PAMI, Jan 2013! Some resources:! ‣ LIBSVM: kernel SVM classifier training and testing ➡ http://www.csie.ntu.edu.tw/~cjlin/libsvm/ ‣ LIBLINEAR: fast linear classifier training ➡ http://www.csie.ntu.edu.tw/~cjlin/liblinear/ ‣ LIBSPLINE: fast additive kernel training and testing ➡ https://github.com/msubhransu/libspline

CMPSCI 689 Subhransu Maji (UMASS) 27/27