<<

A Path Algorithm for Support Vector Machines

Gang Wang [email protected] Dit-Yan Yeung [email protected] Frederick H. Lochovsky [email protected] Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China

Abstract allows the inner product between two points in the fea- The choice of the kernel function which deter- ture space to be computed without having to know the mines the mapping between the input space explicit mapping from the input space to the feature and the feature space is of crucial importance space. Rather, it is simply a function of two points in to kernel methods. The past few years have the input space. In general, the input space does not seen many efforts in learning either the ker- have to be a vector space and hence structured, non- nel function or the kernel matrix. In this vectorial data can also be handled in the same way. paper, we address this model selection issue For a kernel method to perform well, the kernel func- by learning the hyperparameter of the ker- tion often plays a very crucial role. Rather than choos- nel function for a support vector machine ing the kernel function and setting its hyperparame- (SVM). We trace the solution path with re- ters manually, many attempts have been made over spect to the kernel hyperparameter without the past few years to automate this process, at least having to train the model multiple times. partially. Ong et al. (2005) shows that the kernel Given a kernel hyperparameter value and the function is a linear combination of a finite number of optimal solution obtained for that value, we prespecified hyperkernel evaluations, and introduces a find that the solutions of the neighborhood method to learn the kernel function directly in an in- hyperparameters can be calculated exactly. ductive setting. Some approaches have been proposed However, the solution path does not exhibit (Cristianini et al., 2002; Bach et al., 2004; Lanckriet piecewise linearity and extends nonlinearly. et al., 2004; Sonnenburg et al., 2006; Zhang et al., As a result, the breakpoints cannot be com- 2006) to seek a kernel matrix directly or learn a conic puted in advance. We propose a method to combination of kernel matrices from data. approximate the breakpoints. Our method is both efficient and general in the sense that Our paper adopts the kernel function learning ap- it can be applied to many kernel functions in proach. However, unlike the method of hyperkernels, common use. we seek to learn the optimal hyperparameter value for a prespecified kernel function. The traditional approach to this model selection problem is to ap- m 1. Introduction ply methods like -fold cross validation to determine the best choice among a number of prespecified can- Kernel methods (M¨uller et al., 2001; Sch¨olkopf & didate hyperparameter values. Extensive exploration Smola, 2002) have demonstrated great success in solv- such as performing line search for one hyperparameter ing many and or grid search for two hyperparameters is usually ap- problems. These methods implicitly map data points plied. However, this requires training the model mul- from the input space to some feature space where even tiple times with different hyperparameter values and relatively simple algorithms such as linear methods can hence is computationally prohibitive especially when deliver very impressive performance. The implicit fea- the number of candidate values is large. Keerthi et al. ture mapping is determined by a kernel function, which (2006) proposed a hyperparameter tuning approach

th based on minimizing a smooth performance validation Appearing in Proceedings of the 24 International Confer- function. The direction to update the hyperparame- ence on Machine Learning, Corvallis, OR, 2007. Copyright 2007 by the author(s)/owner(s). ters is the (negative) gradient of the validation func- A Kernel Path Algorithm for Support Vector Machines tion with respect to the hyperparameters. This ap- over, our algorithm is general in the sense that it can proach also requires training the model and computing be applied to many kernel functions in common use. the gradient multiple times. Some approaches have been proposed to overcome 2. Problem Formulation these problems. A promising recent approach is based In a binary classification problem, we have a set of on solution path algorithms, which can trace the en- n d training pairs {(xi,yi)} where xi ∈X ⊆R is the tire solution path as a function of the hyperparame- i=1 input data, X is the input space, and yi ∈{−1, +1} is ter without having to train the model multiple times. the class label. The goal is to learn a decision function A solution path algorithm is also called a regulariza- f(x) that can classify the unseen data as accurately tion path algorithm if the hyperparameter involved in as possible. Its associated classifier is sign[f(x)]. Us- path tracing is the regularization parameter in the op- ing the hinge loss, the primal optimization problem for timization problem. Efron et al. (2004) proposed an SVC can be formulated based on the standard regu- algorithm called the least angle regression (LARS). It larization framework as follows: can be applied to trace the regularization path for lin- n λ 2 ear least square regression problem regularized with min Rprimal = i=1 ξi + fK (1) f∈H 2 the L1 norm. The path is piecewise linear and hence . .yf ≥ − ξ it is efficient to explore the entire path just by mon- s t i (xi) 1 i (2) itoring the breakpoints between the linear segments ξi ≥ 0. (3) only. Zhu et al. (2003) proposed an algorithm to com- pute the entire regularization path for the L1-norm Here and below, i =1, ..., n. ·K denotes the norm in support vector classification (SVC) and Hastie et al. the reproducing kernel Hilbert space (RKHS) H corre- (2004) proposed one for the standard L2-norm SVC. sponding to a positive definite kernel function K. λ is They are again based on the property that the paths the regularization parameter which gives the balance are piecewise linear. More generally, Rosset and Zhu between the two opposing components of the objec- (2003) showed that any model with an L1 regulariza- tive function. A kernel function, i.e., Kγ (xi, xj), is a tion and a quadratic, piecewise quadratic, piecewise bivariate function with its two independent variables linear, or linear loss function has a piecewise linear defined in the input space. Here we explicitly show regularization path and hence the entire path can be the hyperparameter γ of the kernel function in the computed efficiently. Bach et al. (2005) explored a subscript. Different values of γ embed the data into nonlinear regularization path for multiple kernel learn- different feature spaces. In SVC, both λ and γ are ing regularized with a block L1-norm. Rosset (2004) hyperparameters in the model. proposed a general path following algorithm to approx- Using Wahba’s representer theorem and Wolfe’s dual- imate the nonlinear regularization path. Besides the ity theorem, we can derive the dual form of the opti- regularization parameter, we showed in our previous mization problem as: work (Wang et al., 2006) that this approach can also n be applied to explore the solution paths for some other max Rdual (β,β0)=λ i=1 βi − (4) β,β model hyperparameters. 0 1 n i,j=1 βiβjyiyjKγ (xi, xj) In this paper, we propose a novel method that traces 2 . . n y β the solution path with respect to the kernel hyperpa- s t i=1 i i = 0 (5) rameter in SVC. We refer to this solution path as a 0 ≤ βi ≤ 1, (6) kernel path. Given a kernel hyperparameter value and where the dual variables (Lagrange multipliers) β are the optimal solution obtained for that value, we find introduced for constraint (2). The decision function is that the solutions of the neighborhood hyperparame- given by ters can be calculated exactly. Since the kernel hyper- n parameter is embedded into each entry of the kernel f 1 β y K , β . matrix, the path is piecewise smooth but not piecewise (x)=λ i i γ (x xi)+ 0 (7) i=1 linear. The implication is that the next breakpoint cannot be computed in advance before reaching it like f(x) is expressed as an expansion in terms of only a what other solution path algorithms do. We propose subset of data points, called support vectors (SV), for a method to approximate the breakpoints. Unlike the which βi is nonzero. path following algorithm of Rosset (2004) for nonlin- From the KKT optimality conditions, we have three ear regularization paths, our approach is more efficient cases to consider depending on the value of yif(xi), so that the kernel path can be traced efficiently. More- which in turn determines how much the loss ξi is: A Kernel Path Algorithm for Support Vector Machines

• If yif(xi) > 1, then ξi =0andβi =0; hold, then the corresponding linear system becomes a L(β (μ + ),μ+ )=0. • y f ξ β ∈ , E If i (xi) = 1, then i =0and i [0 1]; a Expanding L(βE (μ + ),μ+ ) via first-order Taylor a • If yif(xi) < 1, then ξi > 0andβi =1. series approximation around βE (μ), we have

βa μ  ,μ   βa μ ,μ  These three cases refer to points lying outside, at and L( E ( + ) + ) L( E ( ) + ) (13) a inside the margins, respectively. We can see that βi ∂L(βE (μ),μ+ ) a a + a [βE (μ + ) − βE (μ)] . can take non-extreme values other than 0 and 1 at ∂βE the optimal solution βˆ only if the value of yif(xi)is a equal to 1. β0 is computed based on the points at the The gradient of L with respect to βE is a matrix of margins. size m × (m + 1): The optimal solution may be regarded as a vector func- ∂ 1 L tion of some hyperparameter, denoted as μ: a =[yE Kγ ] , (14) ∂βE βˆ μ , βˆ μ R β,β μ .   ( ( ) 0( )) = arg max dual ( 0; ) (8) y y K , m (β,β0) where Kγ = E(i) E(j) γ (xE(i) xE(j)) i,j=1.Itisin- a teresting to note that the gradient ∂L/∂βE does not μ a Thus, the optimal solution varies as changes. For contain the parameter βE any more. Hence all terms notational simplicity, we remove the ˆ notation in the after the first-order term of the Taylor series are in fact rest of the paper. For every value of μ, we can parti- equal to zero, implying that the  in (13) should be tion the set of indices for the n data points into three replaced by =. From constraint (5), we also have subsets, E, L and R, which are referred to as the elbow, T left of the elbow,andright of the elbow, respectively, to yE (βE (μ + ) − βE (μ)) = 0. (15) be consistent with the convention introduced in Hastie

et al. (2004): Integrating equationsa (13) and (15) together, we know ⎧ ⎛ ⎞ ⎫ the next solution βE (μ + ) can be updated as ⎨ n ⎬   ⎝ ⎠ T −1  E = i : yi βjyjKγ (xi, xj)+β0 = λ, 0 ≤ βi ≤ 1 (9) 0 yE ⎩ ⎭ βa μ  βa μ a 0 . j=1 E ( + )= E ( )+ ∂L(βE (μ),μ+) a a −L(βE (μ),μ+ ) ⎧ ⎛ ⎞ ⎫ ∂βE ⎨ n ⎬ (16) L i y ⎝ β y K , β ⎠ <λ,β μ = ⎩ : i j j γ (xi xj)+ 0 i =1⎭ (10) Hence, given a hyperparameter and its correspond- j=1 β μ ,β μ ⎧ ⎛ ⎞ ⎫ ing optimal solution ( ( ) 0( )), the solutions of its ⎨ n ⎬ neighborhood hyperparameters can be computed ex- R i y ⎝ β y K , β ⎠ >λ,β actly as long as the three point sets remain unchanged. = ⎩ : i j j γ (xi xj)+ 0 i =0⎭ (11) j=1 However, some events may occur when we change the μ   value of to a larger extent. An event is said to oc- ∂β(μ) ∂β (μ) cur when some point sets change. We categorize these Therefore, the direction , 0 in which the ∂μ ∂μ events as follows: solution moves as μ changes should maintain the con- ditions (9)–(11) above. • A new point i joins E, i.e., the condition (10) for Suppose E contains m indices which are represented a variable βi with i ∈Lor the condition (11) for m E , ··· , E m E i < E j a as an -tuple ( (1) ( )) such that ( ) ( ) a variable βi with i ∈Rceases to hold if βE (μ) T for i

2.1. Regularization Path Algorithm • A point i/∈E hits the elbow, i.e., yif(xi)=1. Our goal is to generate the solution path for a range of hyperparameter values by repeatedly calculating the By monitoring the occurrence of these events, we com- γ<γl next optimal solution based on the current one. Based pute the largest for which an event occurs. γ γl+1 on the updating formula in equation (16), we can easily This value is a breakpoint and is denoted by . derive the path following algorithm with respect to the We then update the point sets and continue until the regularization parameter λ. Replacing μ with λ,we algorithm terminates. have     T −1 Table 1. Kernel path algorithm. a a 0 yE 0 βE (λ + )=βE (λ)+ , (17) Input: yE Kγ 1 βˆ, βˆ0, γ0 - initial solution for γ; where the γ value is fixed. Note that equation (17) for θ, , γmin - decay rate, error tolerance, γ limit the solution updating is equivalent to the formula in 0 0 Hastie et al. (2004). Substituting (17) into (7), we can 1. t =0;β = βˆ,β0 = βˆ0 calculate the next breakpoint at which the conditions 2. while γ>γmin (9)–(11) will not hold if λ is changed further. As a 3. r = θ; result, the regularization path can be explored. 4. while r<1 −  5. γ = rγt; 3. Kernel Path Algorithm 6. use (18) to compute (β(γ),β0(γ)) 7. if (β(γ),β0(γ)) is the valid solution μ γ t+1 t+1 Replacing by in equation (16), the kernel path 8. β = β(γ); β0 = β0(γ); γt+1 = γ; algorithm can be derived in a similar manner. We t = t +1 l γ 1/2 consider the period between the th event (with = 9. else r = r γl l γ γl+1 )andthe( + 1)th event (with = ). The 10. endif E L R sets , and remain unchanged during this period. 11. end-while β γ ,β γ γ Thus we trace the solution path of ( E ( ) 0( )) as 12. update the point sets E, L, R; changes. 13.end-while ; l Theorem 1 Suppose the optimal solution is (β ,βl ) 0 Output: when γ = γl. Then for any γ in γl+1 <γ<γl,we a sequence of solutions (β(γ),β0(γ)),γmin <γ<γ0 have the following results:

l In the previous works (Zhu et al., 2003; Hastie et al., • For the points i ∈L∪R, βi = βi is fixed at 0 or 1, which is independent of γ; 2004; Wang et al., 2006), the solution path is piece- wise linear with respect to some hyperparameter. The • The solution to (βE ,β0) is given by breakpoint at which the next event occurs can be cal-         culated in advance before actually reaching it. How- β βl T −1 0 0 0 yE 0 , ever, the value of the kernel hyperparameter is im- β = βl + E E yE Kγ bγ plicitly embedded into the pairwise distance between (18) points. As a result, we need to specify a γ value in where advance to compute the next solution and then check  m whether the next event has occurred or not. Suppose Kγ = yE(i)yE(j)Kγ (xE(i), xE(j)) , (19) i,j=1 we are given the optimal solution at γ = γl.Wepro- ⎛ ⎡ ⎤ ⎞m n pose here an efficient algorithm for estimating the next ⎝ ⎣ l l ⎦ ⎠ l+1 bγ = −yE(i) βjyjKγ (xE(i), xj)+β0 + λ (20). breakpoint, i.e., γ , at which the next event occurs. j=1 i=1 Table 1 shows the pseudocode of our proposed kernel path algorithm. The user has to specify a decay rate As such, we can update the solution to (βE ,β0) exactly θ ∈ (0, 1). At each iteration, γ is decreased through while those to others remain unchanged. The value multiplying it by θ. If the next event has not occurred, of γ can either increase or decrease. As γ changes, we continue to multiply γ by θ. Otherwise the decay the algorithm monitors the occurrence of any of the rate is set to θ1/2. The above steps are repeated until following events: the decay rate becomes less than (1 − ), where  is some error tolerance specified in advance by the user. • One of the βE(i) for i =1,...,m reaches 0 or 1. Hence, we can estimate the breakpoint γ such that A Kernel Path Algorithm for Support Vector Machines

γl+1(1 − ) ≥ γ ≥ γl+1. Note that this algorithm only of O(n2 + m3) for each iteration where m is typically describes the kernel path algorithm in the decreasing much smaller than n. The total number of events is the direction. In general, γ can either increase or decrease. number of times that the points pass through E.Its The only difference for the increasing direction is to set value depends on both the range [γmin,γ0] and the λ θ greater than 1. An SVM solver is needed in the be- value specified by the user. From many experiments, ginning to initialize the optimal solution for some γ empirical findings suggest that the number of itera- value. tions is always some small multiple c of n for a large enough range of γ, which implies that the total compu- We assume on average the ratio of the γ values tational cost is O(cn3 + cnm3). In practice, users usu- (γt+1/γt) at two consecutive events is π. Thus, the al- ally try out a small number of hyperparameter values gorithm needs (log π + log (log θ)) iterations from θ 2 1− on an SVM solver to gain some basic understanding of γt to γt+1. The choice of θ is a tradeoff between log π θ the data. This preliminary investigation can provide and log (log θ), and the choice of  represents a 2 1− some hints on how to specify the region in which the tradeoff between computational complexity and accu- users are interested for the kernel path algorithm. racy. Setting the error tolerance  to a large value may lead to the algorithm getting stuck, and here we set it to 10−6. It is not necessary to set  to be very small, 4. Discussion say 10−9, which will require a great deal of unnecessary In step 7, the algorithm checks whether the new solu- computation. Figure 1 shows the number of iterations tion is valid or not. A naive way is to scan through needed as a function of the decay rate θ. Three aver- the entire training set to validate the move. However, age ratios, π =0.85, 0.9 and 0.95, are considered. If θ this is too slow involving a great deal of unnecessary is chosen to be not very close to 1, then the number computation. A feasible alternative for larger datasets of iterations does not vary much. We set θ =0.95 is to keep track of only a small set of marginal points, and  =10−6 in our experiments, and the algorithm i.e., M = {i ∈E|βi <ξ0 or βi > 1 − ξ1}∪{i ∈ always takes less than 20 iterations to reach the next L|yif(xi) > 1 − ξ2}∪{i ∈R|yif(xi) < 1+ξ3} breakpoint. for small positive numbers ξ0, ξ1, ξ2 and ξ3, and dis-

ε = 1e−6 card all other points. A marginal point is one that 28 is likely to leave the set to which it currently belongs π = 0.85 π = 0.9 and join another set. Thus, only a small number of 26 π = 0.95 entries in the kernel matrix need to be re-calculated.

24 Keeping track of only these points between two con- secutive events makes the algorithm significantly more 22 efficient. Every time after one or several events occur,

iterations the set of marginal points is then updated accordingly. 20 Note that the kernel function used in the algorithm can 18 be very general. For example, we may use the polyno- d mial kernel Kc,d(xi, xj)=( xi, xj + c) or the Gaus- 16 2 0.7 0.75 0.8 0.85 0.9 0.95 1 sian RBF kernel Kγ (xi, xj )=exp(−xi − xj /γ). θ If the kernel function contains multiple hyperparam-

Figure 1. Number of iterations logθ π +log2(log1− θ)vs. eters, the kernel solution path may still be traced in decay rate θ. Three different π values (0.85, 0.9, 0.95) are a multivariate space by updating the solution for one considered and the error tolerance  is set to 10−6. hyperparameter while holding the others fixed at each iteration. Since γ decreases at each iteration, the kernel matrix also changes accordingly. However, it is not neces- Rosset (2004) proposed a general path following algo- sary to recompute the entire kernel matrix. To update rithm based on second-order approximation when the the solution through equation (18), only the entries solution path is not piecewise linear. It assumes that the solution update is not exact. Thus the algorithm Kγ (xi, xj)fori ∈Eand j ∈E∪Lneed to be computed only takes a very small step s in each iteration and again. The cost of calculating bγ is O(mn) and that of inverting an (m+1)×(m+1) matrix is O(m3). To mon- applies a single Newton-Raphson step to approximate itor whether a point leaves the current set to join an- the next solution. Since it does not try to calculate the breakpoint value, the difference between the estimated other set, the entries Kγ (xi, xj),i ∈ 1, ..., n, j ∈E∪L γ γl |γ −γ |

γ = 5 γ = 2 γ = 0.5

1 1 1

0.4 0.4 0.4 x2 x2 x2

−0.2 −0.2 −0.2

−0.8 −0.8 −0.8 −1.5 −0.5 0.5 1.5 2.5 −1.5 −0.5 0.5 1.5 2.5 −1.5 −0.5 0.5 1.5 2.5 x1 x1 x1 (a) γ =5 (b)γ =2 (c)γ =0.5

Figure 2. SVM classification results of the two-moons dataset for different kernel hyperparameter values (γ =5, 2, 0.5). In each sub-figure, the middle line (green) shows the decision boundary and the other two lines specify the margins with the points it contains indicated by blue circles. λ is set to 1.

400 100 λ = 5 λ = 5 1 λ = 1 λ = 1

80 300 λ λ = 0.1 = 0.1

0.9 60 200

40 0.8 λ Number of SVs = 5 Number of Events 100

Generalization Accuracy λ = 1 20 λ = 0.1 0.7 0 0.02 0.1 0.5 2 10 0.02 0.1 0.5 2 10 0.02 0.1 0.5 2 10 γ γ γ (a) (b) (c)

Figure 3. Change in (a) the generalization accuracy and (b) the number of SVs and (c) the number of events occurred along the kernel path with different λ values for the two-moons dataset. The horizontal axis is in logarithm scale. The generalization accuracy first increases and then decreases and the number of SVs first decreases and then increases as γ decreases from 10 to 0.2.

to be very small. Otherwise, the total number of iter- classes are well separated from each other. The second ations will become very large. On the contrary, in our dataset is generated from two Gaussian distributions algorithm, if the number of iterations increases, the with different means and covariances corresponding to error tolerance will decrease exponentially, making it the positive and negative classes which partially over- possible to keep the error tolerance arbitrarily small. lap with each other. Each of the two datasets contains 100 positive points and 100 negative points. We ran- 5. Experimental Results domly select 50% of the points for training and the rest for estimating the generalization error. The Gaussian The behavior of the above algorithms can best be illus- RBF kernel is used in the experiments. To explore the trated using video. We have prepared some illustra- solutions along the kernel path, we first use LIBSVM tive examples as video in http://www.cse.ust.hk/ (Chang & Lin, 2001) to compute an initial optimal so- ~wanggang/sol_path/SvmKernelPath.htm. lution and then execute the kernel path algorithm as described above. In our experiments, we consider two binary classifica- tion datasets which possess very different properties. For the two-moons dataset, the kernel path starts from The first is the two-moons dataset with strong mani- γ = 10 and extends in the decreasing direction of γ fold structure. It is generated according to a pattern of until γ =0.02, while setting λ = 1. Figure 2 shows two intertwining moons and the data points in the two the classification results at three points of the kernel A Kernel Path Algorithm for Support Vector Machines

γ = 5 γ = 2 γ = 0.5

2 2 2

1 1 1

0 0 0 x2 x2 x2 −1 −1 −1

−2 −2 −2

−3 −3 −3

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 x1 x1 x1 x1 (a) γ =5 (b)γ =2 (c)γ =0.5

Figure 4. SVM classification results of the two-Gaussians dataset for different kernel hyperparameter values (γ =5, 2, 0.5). λ is set to 1.

100 200

λ = 5 λ = 5 0.92 90 λ =1 λ =1 160 λ = 0.1 λ =0.1 0.88 80 120

70 0.84 80 60 Number of SVs

λ = 5 Number of Event 0.8 λ 40 Generalization Accuracy = 1 50 λ = 0.1

40 0 0.2 1 5 0.2 1 5 0.2 1 5 10 γ γ γ (a) (b) (c)

Figure 5. Change in (a) the generalization accuracy and (b) the number of SVs and (c) the number of events occurred along the kernel path with different λ values for the two-Gaussians dataset. The horizontal axis is in logarithm scale. The generalization accuracy decreases and the number of SVs increases as γ decreases from 10 to 0.2.

path, γ =5, 2, 0.5. As we can see, different kernel in the beginning of the kernel paths have significant hyperparameter values produce dramatically different differences, they all can reach 100% accuracy as γ de- decision boundaries. When γ is large, the decision creases. However, as γ further decreases to be very boundary is closer to linear, which leads to large gen- small, the margins become so elastic that most of the eralization error in this dataset. As γ decreases, the input points are included as SVs. Hence, the decision decision boundary becomes more flexible to fit into the function has high variance, which leads to the overfit- gap between the two classes. Thus, the generalization ting problem. We also notice from Figure 3(c) that error decreases accordingly. When γ =0.5, the deci- the computational cost of the kernel path can be op- sion function well separates the points between the two timized by setting a proper regularization value. classes and only a small number of SVs are used to rep- Figure 4 shows the classification results of the two- resent the function. Figure 3 shows the change in the Gaussians dataset for different γ values. Unlike the generalization accuracy, the number of SVs, and the two-moons dataset, this dataset has an overlapping number of events occurred along the kernel path with region between the two classes. Hence, the optimal different λ values. We notice that the generalization decision boundary is expected to be a smooth surface accuracy increases and the number of SVs decreases as that passes through the overlapping region. This can γ decreases from 10 to 0.5. Since there exists a well- be obtained by setting γ to a relatively large value. separated boundary between the two classes, setting a Since setting a larger λ value in this dataset leads smaller λ value will give fewer SVs and attain better to wider margins and a smoother decision boundary, performance. Although the generalization accuracies higher generalization accuracy can be obtained and A Kernel Path Algorithm for Support Vector Machines more points become SVs. As the kernel path extends chine Learning (ICML-04). in the decreasing direction, the decision function be- Bach, F., Thibaux, R., & Jordan, M. (2005). Regulariza- comes more rugged and a higher degree of overfitting tion paths for learning multiple kernels. Advances in will occur. The decision boundary changes slightly Neural Information Processing Systems 17 (NIPS-05). as the kernel path starts from γ = 10 and both the number of SVs and the generalization accuracy re- Chang, C., & Lin, C. (2001). LIBSVM: a library for support vector machines. Software available at http: main stable. The decision function shows more sig- //www.csie.ntu.edu.tw/~cjlin/libsvm/. nificant changes as the kernel path algorithm proceeds after γ is less than 2. As γ decreases, more and more Cristianini, N., Kandola, J., Elissee, A., & Shawe-Taylor, J. (2002). On kernel target alignment. Advances in Neural points become SVs and the generalization accuracy Information Processing Systems 15 (NIPS-02). drops dramatically. Figure 5(a) and (b) illustrate this. As a smaller λ value tends to minimize the training er- Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. ror, more severe overfitting will occur along the kernel (2004). Least angle regression. The Annals of Statis- tics, 32, 407–499. path for λ =0.1thanforλ = 5 in this dataset. When γ becomes small, most points become SVs and the deci- Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). sion function is no longer a very sparse representation. The entire regularization path for the support vector ma- Its generalization accuracy thus decreases. chine. Journal of Machine Learning Research, 5, 1391– 1415. Keerthi, S., Sindhwani, V., & Chapelle, O. (2006). An 6. Conclusion and Future Work efficient method for gradient-based adaption of hyper- In kernel machines, it is very important to set the ker- parameters in SVM models. Advances in Neural Infor- mation Processing Systems 19 (NIPS-06). nel hyperparameters to appropriate values in order to obtain a classifier that generalizes well. In this paper, Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui, L., & we derive an exact updating formula to calculate solu- Jordan, M. (2004). Learning the kernel matrix with semidefinite programming. Journal of Machine Learning tions along the kernel path in SVC. Since the piecewise Research, 5, 27–72. linearity property does not hold along the kernel path, breakpoints cannot be calculated in advance. We pro- M¨uller, K., Mika, S., R¨atsch, G., Tsuda, K., & Sch¨olkopf, pose an efficient algorithm to find the breakpoints so B. (2001). An introduction to kernel-based learning al- gorithms. IEEE Transactions on Neural Networks, 12, that the kernel path can be traced without having to 181–202. train the model multiple times. Ong, C., Smola, A., & Williamson, R. (2005). Learning the It is also important to set a proper regularization value kernel with hyperkernels. Journal of Machine Learning in SVC, since it can not only produce a well general- Research, 6, 1043–1071. ized decision function but can also reduce the number Rosset, S. (2004). Following curved regularized optimiza- of iterations, leading to speedup of the kernel path al- tion solution paths. Advances in Neural Information gorithm. We are interested in an integration of this Processing Systems 17 (NIPS-04). kernel path algorithm with a regularization path algo- Rosset, S., & Zhu, J. (2003). Piecewise linear regularized rithm into a unified approach for tracing solution paths solution paths (Technical Report). Stanford University. in a two-dimensional hyperparameter space. This di- rection will be explored in our future work. Sch¨olkopf, B., & Smola, A. (2002). Learning with kernels. MIT Press. Acknowledgment Sonnenburg, S., R¨atsch, G., Sch¨afer, C., & Sch¨olkopf, B. (2006). Large scale . Journal of This research is supported by Competitive Earmarked Machine Learning Research, 7, 15311565. Research Grant (CERG) 621706 from the Research Wang, G., Yeung, D., & Lochovsky, F. (2006). Two- Grants Council (RGC) of the Hong Kong Special Ad- dimensional solution path for support vector regression. ministrative Region, China. The author would also Proceedings of the 23th International Conference on Ma- like to thank the anonymous reviewers for their con- chine Learning (ICML-06). structive comments. Zhang, Z., Kwok, J., & Yeung, D. (2006). Model-based transductive learning of the kernel matrix. Machine References Learning, 63, 69–101. Zhu, J., Rosset, S., Hastie, T., & Tibshirani, R. (2003). Bach, F., Lanckriet, R., & Jordan:, M. (2004). Multiple 1-norm support vector machines. Advances in Neural kernel learning, conic duality, and the SMO algorithm. Information Processing Systems 16 (NIPS-03). Proceedings of the 21th International Conference on Ma-