A Kernel Path Algorithm for Support Vector Machines

A Kernel Path Algorithm for Support Vector Machines Gang Wang [email protected] Dit-Yan Yeung [email protected] Frederick H. Lochovsky [email protected] Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China Abstract allows the inner product between two points in the fea- The choice of the kernel function which deter- ture space to be computed without having to know the mines the mapping between the input space explicit mapping from the input space to the feature and the feature space is of crucial importance space. Rather, it is simply a function of two points in to kernel methods. The past few years have the input space. In general, the input space does not seen many efforts in learning either the ker- have to be a vector space and hence structured, non- nel function or the kernel matrix. In this vectorial data can also be handled in the same way. paper, we address this model selection issue For a kernel method to perform well, the kernel func- by learning the hyperparameter of the ker- tion often plays a very crucial role. Rather than choos- nel function for a support vector machine ing the kernel function and setting its hyperparame- (SVM). We trace the solution path with re- ters manually, many attempts have been made over spect to the kernel hyperparameter without the past few years to automate this process, at least having to train the model multiple times. partially. Ong et al. (2005) shows that the kernel Given a kernel hyperparameter value and the function is a linear combination of a finite number of optimal solution obtained for that value, we prespecified hyperkernel evaluations, and introduces a find that the solutions of the neighborhood method to learn the kernel function directly in an in- hyperparameters can be calculated exactly. ductive setting. Some approaches have been proposed However, the solution path does not exhibit (Cristianini et al., 2002; Bach et al., 2004; Lanckriet piecewise linearity and extends nonlinearly. et al., 2004; Sonnenburg et al., 2006; Zhang et al., As a result, the breakpoints cannot be com- 2006) to seek a kernel matrix directly or learn a conic puted in advance. We propose a method to combination of kernel matrices from data. approximate the breakpoints. Our method is both efficient and general in the sense that Our paper adopts the kernel function learning ap- it can be applied to many kernel functions in proach. However, unlike the method of hyperkernels, common use. we seek to learn the optimal hyperparameter value for a prespecified kernel function. The traditional approach to this model selection problem is to ap- m 1. Introduction ply methods like -fold cross validation to determine the best choice among a number of prespecified can- Kernel methods (Müller et al., 2001; Schölkopf & didate hyperparameter values. Extensive exploration Smola, 2002) have demonstrated great success in solv- such as performing line search for one hyperparameter ing many machine learning and pattern recognition or grid search for two hyperparameters is usually ap- problems. These methods implicitly map data points plied. However, this requires training the model mul- from the input space to some feature space where even tiple times with different hyperparameter values and relatively simple algorithms such as linear methods can hence is computationally prohibitive especially when deliver very impressive performance. The implicit fea- the number of candidate values is large. Keerthi et al. ture mapping is determined by a kernel function, which (2006) proposed a hyperparameter tuning approach th based on minimizing a smooth performance validation Appearing in Proceedings of the 24 International Confer- function. The direction to update the hyperparame- ence on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). ters is the (negative) gradient of the validation func- A Kernel Path Algorithm for Support Vector Machines tion with respect to the hyperparameters. This ap- over, our algorithm is general in the sense that it can proach also requires training the model and computing be applied to many kernel functions in common use. the gradient multiple times. Some approaches have been proposed to overcome 2. Problem Formulation these problems. A promising recent approach is based In a binary classification problem, we have a set of on solution path algorithms, which can trace the en- n d training pairs {(xi,yi)} where xi ∈X ⊆R is the tire solution path as a function of the hyperparame- i=1 input data, X is the input space, and yi ∈{−1, +1} is ter without having to train the model multiple times. the class label. The goal is to learn a decision function A solution path algorithm is also called a regulariza- f(x) that can classify the unseen data as accurately tion path algorithm if the hyperparameter involved in as possible. Its associated classifier is sign[f(x)]. Us- path tracing is the regularization parameter in the op- ing the hinge loss, the primal optimization problem for timization problem. Efron et al. (2004) proposed an SVC can be formulated based on the standard regu- algorithm called the least angle regression (LARS). It larization framework as follows: can be applied to trace the regularization path for lin- n λ 2 ear least square regression problem regularized with min Rprimal = i=1 ξi + fK (1) f∈H 2 the L1 norm. The path is piecewise linear and hence . .yf ≥ − ξ it is efficient to explore the entire path just by mon- s t i (xi) 1 i (2) itoring the breakpoints between the linear segments ξi ≥ 0. (3) only. Zhu et al. (2003) proposed an algorithm to com- pute the entire regularization path for the L1-norm Here and below, i =1, ..., n. ·K denotes the norm in support vector classification (SVC) and Hastie et al. the reproducing kernel Hilbert space (RKHS) H corre- (2004) proposed one for the standard L2-norm SVC. sponding to a positive definite kernel function K. λ is They are again based on the property that the paths the regularization parameter which gives the balance are piecewise linear. More generally, Rosset and Zhu between the two opposing components of the objec- (2003) showed that any model with an L1 regulariza- tive function. A kernel function, i.e., Kγ (xi, xj), is a tion and a quadratic, piecewise quadratic, piecewise bivariate function with its two independent variables linear, or linear loss function has a piecewise linear defined in the input space. Here we explicitly show regularization path and hence the entire path can be the hyperparameter γ of the kernel function in the computed efficiently. Bach et al. (2005) explored a subscript. Different values of γ embed the data into nonlinear regularization path for multiple kernel learn- different feature spaces. In SVC, both λ and γ are ing regularized with a block L1-norm. Rosset (2004) hyperparameters in the model. proposed a general path following algorithm to approx- Using Wahba’s representer theorem and Wolfe’s dual- imate the nonlinear regularization path. Besides the ity theorem, we can derive the dual form of the opti- regularization parameter, we showed in our previous mization problem as: work (Wang et al., 2006) that this approach can also n be applied to explore the solution paths for some other max Rdual (β,β0)=λ i=1 βi − (4) β,β model hyperparameters. 0 1 n i,j=1 βiβjyiyjKγ (xi, xj) In this paper, we propose a novel method that traces 2 . n y β the solution path with respect to the kernel hyperpa- s t i=1 i i = 0 (5) rameter in SVC. We refer to this solution path as a 0 ≤ βi ≤ 1, (6) kernel path. Given a kernel hyperparameter value and where the dual variables (Lagrange multipliers) β are the optimal solution obtained for that value, we find introduced for constraint (2). The decision function is that the solutions of the neighborhood hyperparame- given by ters can be calculated exactly. Since the kernel hyper- n parameter is embedded into each entry of the kernel f 1 β y K , β . matrix, the path is piecewise smooth but not piecewise (x)=λ i i γ (x xi)+ 0 (7) i=1 linear. The implication is that the next breakpoint cannot be computed in advance before reaching it like f(x) is expressed as an expansion in terms of only a what other solution path algorithms do. We propose subset of data points, called support vectors (SV), for a method to approximate the breakpoints. Unlike the which βi is nonzero. path following algorithm of Rosset (2004) for nonlin- From the KKT optimality conditions, we have three ear regularization paths, our approach is more efficient cases to consider depending on the value of yif(xi), so that the kernel path can be traced efficiently. More- which in turn determines how much the loss ξi is: A Kernel Path Algorithm for Support Vector Machines • If yif(xi) > 1, then ξi =0andβi =0; hold, then the corresponding linear system becomes a L(β (μ + ),μ+ )=0. • y f ξ β ∈ , E If i (xi) = 1, then i =0and i [0 1]; a Expanding L(βE (μ + ),μ+ ) via first-order Taylor a • If yif(xi) < 1, then ξi > 0andβi =1. series approximation around βE (μ), we have βa μ ,μ βa μ ,μ These three cases refer to points lying outside, at and L( E ( + ) + ) L( E ( ) + ) (13) a inside the margins, respectively. We can see that βi ∂L(βE (μ),μ+ ) a a + a [βE (μ + ) − βE (μ)] . can take non-extreme values other than 0 and 1 at ∂βE the optimal solution βˆ only if the value of yif(xi)is a equal to 1.

Load more