CS260: Machine Learning Algorithms Lecture 8: Kernel Methods

CS260: Machine Learning Algorithms Lecture 8: Kernel Methods Cho-Jui Hsieh UCLA Feb 4, 2019 Outline Linear Support Vector Machines Nonlinear SVM, Kernel methods Multiclass classification Support Vector Machines Given training examples (x1; y1); ··· ; (xn; yn) Consider binary classification: yi 2 f+1; −1g Linear Support Vector Machine (SVM): n X T 1 T arg min C max(1 − yi w xi ; 0) + w w w 2 i=1 (hinge loss with L2 regularization) Prefer a hyperplane with maximum margin Support Vector Machines Goal: Find a hyperplane to separate these two classes of data: T T if yi = 1, w xi ≥ 1; if yi = −1, w xi ≤ −1. Support Vector Machines Goal: Find a hyperplane to separate these two classes of data: T T if yi = 1, w xi ≥ 1; if yi = −1, w xi ≤ −1. Prefer a hyperplane with maximum margin w clearly, x = α kwk for some α (half margin) 1 α = kwk Maximize margin ) minimize kwk Size of margin minimum of kxk such that w T x = 1 1 α = kwk Maximize margin ) minimize kwk Size of margin minimum of kxk such that w T x = 1 w clearly, x = α kwk for some α (half margin) Maximize margin ) minimize kwk Size of margin minimum of kxk such that w T x = 1 w clearly, x = α kwk for some α (half margin) 1 α = kwk Size of margin minimum of kxk such that w T x = 1 w clearly, x = α kwk for some α (half margin) 1 α = kwk Maximize margin ) minimize kwk What if there are outliers? Support Vector Machines (hard constraints) SVM primal problem (with hard constraints): 1 min w T w w 2 T s.t. yi (w xi ) ≥ 1; i = 1;:::; n; Support Vector Machines (hard constraints) SVM primal problem (with hard constraints): 1 min w T w w 2 T s.t. yi (w xi ) ≥ 1; i = 1;:::; n; What if there are outliers? Support Vector Machines d Given training data x1; ··· ; xn 2 R with labels yi 2 f+1; −1g. SVM primal problem: n 1 T X min w w + C ξi w,ξ 2 i=1 T s.t. yi (w xi ) ≥ 1 − ξi ; i = 1;:::; n; ξi ≥ 0 Support Vector Machines SVM primal problem: n 1 T X min w w + C ξi w,ξ 2 i=1 T s.t. yi (w xi ) ≥ 1 − ξi ; i = 1;:::; n; ξi ≥ 0 Equivalent to n 1 T X T min w w +C max(0; 1 − yi w xi ) w 2 i=1 | {z } | {z } hinge loss L2 regularization T Non-differentiable when yi w xi = 1 for some i Stochastic Subgradient Method for SVM T A subgradient of ì (w) = max(0; 1 − yi w xi ): 8 −y x if 1 − y w T x > 0 <> i i i i T 0 if 1 − yi w xi < 0 > T :0 if 1 − yi w xi = 0 Stochastic Subgradient descent for SVM: For t = 1; 2;::: Randomly pick an index i T If yi w xi < 1, then w (1 − ηt )w + ηt nCyi xi T Else (if yi w xi ≥ 1): w (1 − ηt )w Kernel SVM Non-linearly separable problems What if the data is not linearly separable? Solution: map data xi to higher dimensional(maybe infinite) feature space '(xi ), where they are linearly separable. Hard to solve if '(·) maps to very high or infinite dimensional space SVM with nonlinear mapping SVM with nonlinear mapping '(·): n 1 T X min w w + C ξi w,ξ 2 i=1 T s.t. yi (w '(xi )) ≥ 1 − ξi ; ξi ≥ 0; i = 1;:::; n; SVM with nonlinear mapping SVM with nonlinear mapping '(·): n 1 T X min w w + C ξi w,ξ 2 i=1 T s.t. yi (w '(xi )) ≥ 1 − ξi ; ξi ≥ 0; i = 1;:::; n; Hard to solve if '(·) maps to very high or infinite dimensional space Support Vector Machines (dual) Primal problem: 1 2 X min kwk + C ξi w;ξ 2 i T s.t. yi w '(xi ) − 1 + ξi ≥ 0; and ξi ≥ 0 8i = 1;:::; n Equivalent to: 1 2 X X T X min max kwk + C ξi − αi (yi w '(xi ) − 1 + ξi ) − βi ξi w;ξ α≥0;β≥0 2 i i i Under certain condition (e.g., slater's condition), exchanging min; max will not change the optimal solution: 1 2 X X T X max min kwk + C ξi − αi (yi w '(xi ) − 1 + ξi ) − βi ξi α≥0;β≥0 w;ξ 2 i i i Support Vector Machines (dual) Reorganize the equation: 1 2 X T X X max min kwk − αi yi w '(xi ) + ξi (C − αi − βi ) + αi α≥0;β≥0 w;ξ 2 i i i Now, for any given α; β, the minimizer of w will satisfy @L X X = w − α y '(x ) = 0 ) w ∗ = y α '(x ) @w i i i i i i i i Also, we have C = αi + βi , otherwise ξi can make the objective function −∞ Substitue these two equations back we get 1 X T X max − αi αj yi yj '(xi ) '(xj ) + αi α≥0;β≥0;C=α+β 2 i;j i Support Vector Machines (dual) Therefore, we get the following dual problem 1 max {− αT Qα + eT αg := D(α); C≥α≥0 2 T where Q is an n by n matrix with Qij = yi yj '(xi ) '(xj ) Based on the derivations, we know 1 Primal minimum = dual maximum (under slater's condition) 2 Let α∗ be the dual solution and w ∗ be the primal solution, we have ∗ X ∗ w = yi αi '(xi ) i We can solve the dual problem instead of the primal problem. Instead, define \kernel" T K(xi ; xj ) = '(xi ) '(xj ) This is all we need to know for Kernel SVM! Examples: 2 −γkxi −xj k Gaussian kernel: K(xi ; xj ) = e T d Polynomial kernel: K(xi ; xj ) = (γxi xj + c) Other kernels for specific problems: Graph kernels (Vishwanathan et al., \Graph Kernels", JMLR, 2010) Pyramid kernel for image matching (Grauman and Darrell, \The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features". In ICCV, 2005) String kernel (Lodhi et al., \Text classification using string kernels". JMLR, 2002). Kernel Trick Do not directly define '(·) Examples: 2 −γkxi −xj k Gaussian kernel: K(xi ; xj ) = e T d Polynomial kernel: K(xi ; xj ) = (γxi xj + c) Other kernels for specific problems: Graph kernels (Vishwanathan et al., \Graph Kernels", JMLR, 2010) Pyramid kernel for image matching (Grauman and Darrell, \The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features". In ICCV, 2005) String kernel (Lodhi et al., \Text classification using string kernels". JMLR, 2002). Kernel Trick Do not directly define '(·) Instead, define \kernel" T K(xi ; xj ) = '(xi ) '(xj ) This is all we need to know for Kernel SVM! Kernel Trick Do not directly define '(·) Instead, define \kernel" T K(xi ; xj ) = '(xi ) '(xj ) This is all we need to know for Kernel SVM! Examples: 2 −γkxi −xj k Gaussian kernel: K(xi ; xj ) = e T d Polynomial kernel: K(xi ; xj ) = (γxi xj + c) Other kernels for specific problems: Graph kernels (Vishwanathan et al., \Graph Kernels", JMLR, 2010) Pyramid kernel for image matching (Grauman and Darrell, \The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features". In ICCV, 2005) String kernel (Lodhi et al., \Text classification using string kernels". JMLR, 2002). Prediction: for a test data x, n T X T w '(x) = yi αi '(xi ) '(x) i=1 n X = yi αi K(xi ; x) i=1 Support Vector Machines (dual) Training: compute α = [α1; ··· ; αn] by solving the quadratic optimization problem: 1 min αT Qα − eT α 0≤α≤C 2 where Qij = K(xi ; xj ) Support Vector Machines (dual) Training: compute α = [α1; ··· ; αn] by solving the quadratic optimization problem: 1 min αT Qα − eT α 0≤α≤C 2 where Qij = K(xi ; xj ) Prediction: for a test data x, n T X T w '(x) = yi αi '(xi ) '(x) i=1 n X = yi αi K(xi ; x) i=1 Example: ridge regression n 1 2 1 X T 2 min kwk + (w '(xi ) − yi ) w 2 2 i=1 Dual problem: min αT Qα + kαk2 − 2αT y α Kernel Ridge Regression Actually, this \kernel method" works for many different losses Kernel Ridge Regression Actually, this \kernel method" works for many different losses Example: ridge regression n 1 2 1 X T 2 min kwk + (w '(xi ) − yi ) w 2 2 i=1 Dual problem: min αT Qα + kαk2 − 2αT y α Good packages available: LIBSVM (can be called in scikit-learn) LIBLINEAR (for linear SVM, can be called in scikit-learn) Scalability Challenge for solving kernel SVMs (for dataset with n samples): Space: O(n2) for storing the n-by-n kernel matrix (can be reduced in some cases); Time: O(n3) for computing the exact solution. Scalability Challenge for solving kernel SVMs (for dataset with n samples): Space: O(n2) for storing the n-by-n kernel matrix (can be reduced in some cases); Time: O(n3) for computing the exact solution. Good packages available: LIBSVM (can be called in scikit-learn) LIBLINEAR (for linear SVM, can be called in scikit-learn) Multiclass classification Multiclass Learning n data points, L labels, d features n Input: training data fxi ; yi gi=1: Each xi is a d dimensional feature vector Each yi 2 f1;:::; Lg is the corresponding label Each training data belongs to one category Goal: find a function to predict the correct label f (x) ≈ y Multi-label Problems n data points, L labels, d features n Input: training data fxi ; yi gi=1: Each xi is a d dimensional feature vector L Each yi 2 f0; 1g is a label vector (or Yi 2 f1; 2;:::; Lg) Example: yi = [0; 0; 1; 0; 0; 1; 1] (or Yi = f3; 6; 7g) Each training data can belong to multiple categories Goal: Given a testing sample x, predict the correct labels Illustration Multiclass: each row of L has exact one \1" Multilabel: each row of L can have multiple ones Reduction to binary classification Many algorithms for binary classification Idea: transform multi-class or multi-label problems to multiple binary classification problems Two approaches: One versus All (OVA) One versus One (OVO) One Versus All (OVA) Multi-class/multi-label problems with L categories Build L different binary classifiers For the t-th classifier: Positive samples: all the points in class t (fxi : t 2 yi g) Negative samples: all the points not in class t (fxi : t 2= yi g) ft (x): the decision value for the t-th classifier (larger ft ) higher probability that x in class t) Prediction: f (x) = arg max ft (x) t Example: using SVM to train each binary classifier.

CS260: Machine Learning Algorithms Lecture 8: Kernel Methods

Large-Scale Kernel Ranksvm

A Kernel Method for Multi-Labelled Classification

Density Based Data Clustering

A User's Guide to Support Vector Machines

The Forgetron: a Kernel-Based Perceptron on a Fixed Budget

Kernel Methods Through the Roof: Handling Billions of Points Efﬁciently

An Algorithmic Introduction to Clustering

Knowledge-Based Systems Layer-Constrained Variational

NS-DBSCAN: a Density-Based Clustering Algorithm in Network Space

Accountable Off-Policy Evaluation with Kernel Bellman Statistics

Information Flows of Diverse Autoencoders

Learning a Robust Relevance Model for Search Using Kernel Methods