LETTER Communicated by Terrence Sejnowski

An SMO Algorithm for the Potential Support Vector Machine

Tilman Knebel [email protected].de Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 Neural Information Processing Group, Fakultat¨ IV, Technische Universitat¨ Berlin, 10587 Berlin,

Sepp Hochreiter [email protected] Institute of , Johannes Kepler University Linz, 4040 Linz, Austria

Klaus Obermayer [email protected] Neural Information Processing Group, Fakultat¨ IV and Bernstein Center for Computational Neuroscience, Technische Universitat¨ Berlin, 10587 Berlin, Germany

We describe a fast sequential minimal optimization (SMO) procedure for solving the dual optimization problem of the recently proposed potential support vector machine (P-SVM). The new SMO consists of a sequence of iteration steps in which the Lagrangian is optimized with respect to either one (single SMO) or two (dual SMO) of the Lagrange multipliers while keeping the other variables fixed. An efficient selection procedure for Lagrange multipliers is given, and two heuristics for improving the SMO procedure are described: block optimization and annealing of the regularization parameter . A comparison of the variants shows that the dual SMO, including block optimization and annealing, performs ef- ficiently in terms of computation time. In contrast to standard support vector machines (SVMs), the P-SVM is applicable to arbitrary dyadic data sets, but benchmarks are provided against libSVM’s -SVR and C-SVC implementations for problems that are also solvable by standard SVM methods. For those problems, computation time of the P-SVM is comparable to or somewhat higher than the standard SVM. The number of support vectors found by the P-SVM is usually much smaller for the same generalization performance.

1 Introduction

Learning from examples in order to predict is one of the standard tasks in . Many techniques have been developed to solve classifi- cation and regression problems, but by far most of them were specifically designed for vectorial data. However, for many data sets, a vector-based de- scription is inconvenient or may even be wrong, and other representations

Neural Computation 20, 271–287 (2008) C 2007 Massachusetts Institute of Technology 272 T. Knebel, S. Hochreiter, and K. Obermayer

Table 1: (a) Cartoon of a Dyadic Data Set. (b) Pairwise Data: Special Case of Dyadic Data, Where Row and Column Objects Coincide.

ab

ab c d e f a b c d e

α 020−1 −7 −8 a 1 −0.1 0.2 0.9 −0.5

β 17200−2 b −0.1 1 0.2 0.1 0.3 Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 χ 8 −9 −1012c 0.2 0.2 1 −0.2 −0.2 d 0.9 0.1 −0.2 1 0.5 e −0.5 0.3 −0.2 0.5 1

Note: Column objects {a, b,...}, whose attributes should be predicted, are quantitatively described by a matrix of numerical values representing their mutual relationships with the descriptive row objects {α,β,...}. like dyadic data (Hofmann & Puzicha, 1998; Hoff, 2005), which are based on relationships between objects (see Table 1a, left), are more appropriate. Support vector machines (SVMs; Scholkopf¨ & Smola, 2002; Vapnik, 1998) are a successful class of algorithms for solving supervised learning tasks. Although they were originally developed for vectorial data, the actual pre- dictor and the learning procedure make use of a relational representation. Given a proper similarity measure between objects, SVMs learn to predict attributes based on a pairwise “similarity” matrix, which is usually called the Gram matrix (see Table 1b, right). Standard SVMs, however, underlie some technical as well as concep- tual restrictions (for a detailed discussion, see Hochreiter & Obermayer, 2006a). First, SVMs operate on pairwise data and cannot be extended in a straightforward way to general dyadic representations (see Table 1). The similarity measure (kernel) has to be positive definite,1 and the general case of dyadic data, where the relationships between two different sets of ob- jects are quantified, cannot be handled directly.2 True dyadic data, however, occur in several application areas, including bioinformatics (genes versus probes in DNA microarrays, molecules versus transferable atom equiva- lent descriptors for predicting certain properties of proteins) or information retrieval (word frequencies versus documents, mutual hyperlinks between Web pages). Hence, learning algorithms that are able to operate on these data are of high value. Second, standard SVM solutions also face a couple of technical disad- vantages. The solution of an SVM learning problem, for example, is scale sensitive, because the final predictor depends on how the training data

1At least the final Gram matrix of the training examples should be. 2Some workarounds exist. The relational data set can, that is, be treated like a vectorial representation, as in the feature map method (Scholkopf¨ & Smola, 2002). An SMO Algorithm for the Potential Support Vector Machine 273 were scaled (see, e.g., Figures 2 and 3 of Hochreiter & Obermayer, 2006a). This problem of scale sensitivity is avoided by the potential SVM (P-SVM) approach through a modified cost function, which softly enforces a whiten- ing of the data in feature space. A second disadvantage of standard SVMs relates to the fact that all margin errors translate into support vectors. The number of support vectors can therefore be larger than necessary, for ex- ample, if many training data points are from regions in data space where classes of a classification problem overlap. Finally, the P-SVM method leads Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 to an expansion of the predictor into a sparse set of the descriptive “row” objects (see Table 1b) rather than training data points as in standard SVM approaches. It can therefore be used as a wrapper method for feature selec- tion applications, and previous results on quite challenging data sets had been promising (cf. Hochreiter & Obermayer, 2004, 2006b). The practical application of SVMs, however, requires an efficient im- plementation of the dual optimization problem (Hochreiter & Obermayer, 2006a). In the following we describe an implementation based on the idea of sequential minimal optimization (SMO; Platt, 1999; Keerthi & Shevade, 2003; Keerthi, Shevade, Bhattacharyya, & Murthy, 2001; Lai, Mani, & Palaniswami, 2003), and we will compare several variants of the P-SVM SMO with respect to computation time for a number of standard bench- mark data sets. The results demonstrate that an efficient implementation of the P-SVM exists, which makes its application to general, real-world dyadic data sets feasible. (This SMO implementation of the P-SVM is available online at http:// ni.cs.tu-berlin.de/software/psvm.) The optimal SMO ver- sion is compared with libSVM’s -SVR (support vector regression) and C-SVC (support vector classification) for learning problems that are also solvable by standard SVM approaches. For those problems, computation time of the P-SVM is comparable to or somewhat higher than the standard SVM. The number of support vectors found by the P-SVM is usually much smaller than the number of support vectors found by the corresponding -SVR and C-SVC for the same generalization performance, which gives the P-SVM method an advantage, especially for large data sets.

2 The P-SVM Optimization Problem

Let Xφ be the matrix of L data vectors from a feature space φ, w the normal vector of a separating hyperplane, y the L attributes (binary in case of classification, or real valued in case of regression) of the data vectors, and K the L × P kernel matrix of scalar products between the L vectors in feature space and the P complex features provided by the measurement process. Then the P-SVM primal optimization problem has the form (see Hochreiter & Obermayer, 2006a): 1 2 + − min Xφ w + C1 (ξ + ξ ) (2.1) w,ξ +,ξ − 2 274 T. Knebel, S. Hochreiter, and K. Obermayer

+ s.t. K Xφ w − y + ξ + 1 ≥ 0 − K Xφ w − y − ξ − 1 ≤ 0 + − 0 ≤ ξ , ξ , where K is normalized column-wise to zero mean and variance L−1.The parameters C and correspond to the two different regularization schemes, Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 which have been suggested for the P-SVM method, where -regularization has been proven more useful for and C-regularization for classification or regression problems (Hochreiter & Obermayer, 2006a). ξ + and ξ − are the vectors of the slack variables describing violations of the constraints. In order to derive the dual version, we consider the Lagrangian

1 + − L = w Xφ Xφ w + C1 (ξ + ξ ) (2.2) 2 + + − (α ) K Xφ w − y + ξ + − − + (α ) K Xφ w − y − ξ − + + − − − (µ ) ξ − (µ ) ξ , where the α+ and α− are the Lagrange multipliers for the residual error constraints and the µ+ and µ− are the Lagrange multipliers for the slack variable constraints (ξ +, ξ − ≥ 0). Setting the derivative of L with respect to w equal to zero leads to

+ − XXφ w = XK(α − α ), (2.3) which is always fulfilled if

+ − Xφ w = K (α − α ). (2.4)

If XXφ does not have full rank, all solutions w of equation 2.3 lie within a subspace that depends on α+ and α−. Setting the derivative of L with respect to ξ +, ξ − equal to zero leads to

+ + C1 − α − µ = 0 and (2.5) − − C1 − α − µ = 0. (2.6) An SMO Algorithm for the Potential Support Vector Machine 275

Inserting equation 2.4 into equation 2.2 leads to the dual optimization problem:

1 + − + − min (α − α ) K K (α − α ) (2.7) α+,α− 2 + − + − − y K (α − α ) + 1 (α + α )

+ − Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 s.t. 0 ≤ α , α ≤ C1.

The final P-SVM predictor is given by

ypred = f (K α + b), (2.8) where f (h) = sign(h) for classification and f (h) = h for regression. The thresehold variable b is determined by the mean of the label vector y (cf. equation 18 of Hochreiter & Obermayer, 2006a). The Karush-Kuhn-Tucker (KKT) conditions state that for the optimal solution, the product between dual variables and the left-hand side of the primal constraints in equation 2.1 is zero. Using

Q:= K K , + − F := Q(α − α ) − K y, (2.9) we obtain

+ + ξ + ≥ F j j 0 (2.10) − − ξ − ≤ F j j 0 (2.11) for the primal constraints in equation 2.1 and α+ + + ξ + = j F j j 0 (2.12) α− − − ξ − = j (F j j ) 0 (2.13) µ+ ξ + = j j 0 (2.14) µ− ξ − = j j 0 (2.15) for the KKT condition. Following the procedure described in Keerthi et al. (2001), we obtain

+ < (2=⇒.10) ξ + > =⇒(14) µ+ = ⇐⇒(5) α+ = F j 0 j 0 j 0 j C − > (2=⇒.11) ξ − > =⇒(15) µ− = ⇐⇒(6) α− = F j 0 j 0 j 0 j C 276 T. Knebel, S. Hochreiter, and K. Obermayer

+ > ===⇒(2.12,2.1) α+ = F j 0 j 0 − < ===⇒(2.13,2.1) α− = , F j 0 j 0 where the equation numbers above the arrows denote the equations that have been used. Hence KKT conditions are met if Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 α+ = =⇒ + ≥ j 0 F j 0 <α+ < =⇒ + = 0 j C F j 0 (2.16) α+ = =⇒ + ≤ j C F j 0 and

α− = =⇒ − ≤ j 0 F j 0 <α− < =⇒ − = 0 j C F j 0 (2.17) α− = =⇒ − ≥ . j C F j 0

α+ α− 3 Because j and j are never both larger than zero,

α+ > ===⇒(2.12,2.1) + ≤ =⇒ − < ===⇒(13),(1) α− = j 0 F j 0 F j 0 j 0 α− > ===⇒(2.13,2.1) − ≥ =⇒ + > ===⇒(12),(1) α+ = , j 0 F j 0 F j 0 j 0 we can use

+ − α = α − α (2.18) in order to rewrite equation 2.7 in compact form:

1 min α Q α − y Kα + α (2.19) α 2 1

s.t. −C1 ≤ α j ≤ C1.

α = α+ + α− w The term 1 1 ( ) enforces an expansion of that is sparse in the terms with nonzero values α j . This occurs because for large enough values of , this term forces all Lagrange multipliers α j toward zero except for the components that are most relevant for classification or regression. If K K does not have full rank, finite values of lift the degeneracy of the solutions.

3 F j + = F j − = 0 is impossible if >0. An SMO Algorithm for the Potential Support Vector Machine 277

Since the primal, equations 2.1, and dual, equations 2.7, optimization problems are different from those of standard SVM methods (there is, for example, no equality constraint), a new SMO method is necessary. We de- scribe this new SMO method in the following section.

3 The SMO Procedure for the P-SVM Method

3.1 The Optimization Step. The new SMO procedure involves a se- Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 quence of optimization steps in which the Lagrangian, equation 2.19, is optimized with respect to two Lagrange multipliers αi and αk while keep- ing the remaining parameters fixed. We start with the description of the optimization step and then describe the block optimization and annealing heuristics. Because the dual optimization problem of the P-SVM lacks an equality constraint, it is also possible to derive a single SMO procedure, where only one Lagrange parameter is optimized at every step. This sin- gle SMO is described in appendix A. The benchmark results of section 4, however, show that best results are usually obtained using the dual SMO procedure; hence, the dual is to be preferred. Using αold + α αold + α ≥ α+ = j j if j j 0 j αold + α < (3.1) 0ifj j 0 αold + α ≥ α− = 0ifj j 0 j −αold − α αold + α < j j if j j 0 and setting the derivatives of equations 2.7 with respect to αi and αk to zero, we obtain Q Q α F s ! ii ik i + i + i = 0, (3.2) Qki Qkk αk Fk sk where αold + α ≥ 1if i,k i,k 0 s , = . i k − αold + α < 1if i,k i,k 0

Note that Qik = Qki and K is normalized (cf. Hochreiter & Obermayer, 2006a) so that the diagonal elements of Q are equal to 1 (Qii = Qkk = 1). The unconstrained minimum of equation 2.19 with respect to αi and αk is then given by

αnew = αold + α i i i αnew = αold + α , k k k 278 T. Knebel, S. Hochreiter, and K. Obermayer where (F + s ) Q − (F + s ) α = k k ik i i (3.3) i − 2 1 Qik

+ − + α = (Fi si ) Qik (Fk sk ) . k − 2 1 Qik Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021

The evaluation of equations 3.3 requires an estimation of si,k .Forall j ∈ { , } αnew < αold < =− αold > i k we assume j 0if j 0, which implies s j 1. If j 0, we αnew > = αold = = assume j 0, which implies s j 1. If j 0, we consider both s j 1 =− α αold = αold = and s j 1 and use the solution with the larger j .If i k 0, |α | |α | we use the solution for which one of the values i or k is larger. αnew,αnew The unconstrained minimum i k now has to be checked against ≤ αnew ≤ ≤ αnew ≤ the box constraints Li i Ui and Lk k Uk ,

−C if si =−1 −C if sk =−1 Li = Lk = (3.4) 0ifsi = 1 0ifsk = 1 0ifsi =−1 0ifsk =−1 Ui = Uk = (3.5) +C if si = 1 +C if sk = 1, and the values of the Lagrange multipliers have to be properly corrected if the unconstrained minimum is located outside this range. The location of the unconstrained optimum can appear in six nontrivially different configu- rations relative to the position of the box (see Figures 1A–1F). For Figure 1A, αnew,αnew the minimum is given by i k . For Figures 1B and 1C, the minimum can be determined by (1) setting αi (Figure 1B) or αk (Figure 1C) to the bound closer to the unconstrained minimum, (2) calculating Fk (Figure 1B) or Fi (Figure 1C) for the new values, and (3) correcting αk , αk =−Fk − sk (Figure 1B) or αi , αi =−Fi − si (Figure 1C) (see equation A.2). If the new values for αk (Figure 1B) or αi (Figure 1C) violate the constraints, they have again to be set to the nearest bound. In order to identify case D (see Figure 1D) αk has to be evaluated after steps 1 and 2. For Figure 1D, the new αk fulfills the previously violated box constraint. If it violates the other constraint, it has to be set to the corresponding bound. Case E (see Figure 1E) is identified accordingly. For the remaining case (Figure 1F), the optimal value is given by setting both αi and αk to its nearest corner with respect to the unconstrained minimum. After initialization (α = 0), SMO iterations start, and for every itera- tion step, two indices i, k must be chosen. Our choice is based on a mod- ified heuristics described in Platt (1999). The Karush-Kuhn-Tucker (KKT) An SMO Algorithm for the Potential Support Vector Machine 279 Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021

Figure 1: Illustration of the six nontrivial configurations (A–F) of the uncon- strained minimum (shaded ellipsoids with contour lines) of the objective func- tion, equation 2.19, relative to the box constraints (square box). The ×sindicate the location of the minimum after correction. The shaded areas outside the boxes indicate all possible locations of the unconstrained minimum that correspond to a particular configuration.

conditions are violated for all α j for which the corresponding values d j ,  − − , − α =  max( F j F j )if j 0   |F j + | if 0 <αj < C = + α = d j  F j if j C (3.6)  |− + | − <α <  F j if C j 0 −F j + if α j =−C, are larger than zero. From those α j we choose αi such that

i = arg max j d j , j = 1,...,P, 280 T. Knebel, S. Hochreiter, and K. Obermayer

and we terminate the SMO if di is below a predefined threshold λ. After αi has been chosen, we compute the location of the unconstrained minimum, evaluate the box constraints for all pairs (i, k), k ∈ [1,...,i − 1, i + 1,...,P], and then choose k such that

k = arg max j {max(|αi |, |α j |)}, j = 1,...,i − 1, i + 1,...,P.

(3.7) Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021

3.2 Block Optimization. During optimization, it often happens that long sequences of SMO iterations occur involving the same Lagrange pa- rameters. For example, if three Lagrange parameters are far from optimal, the SMO procedure may perform pairwise updates at every iteration step, during which the value of the third parameter increases its distance from its optimal value. To avoid oscillations, we recommend a block update, where equation 2.19 is solved simultanously for more than two Lagrange multipliers. The algorithm, which is based on a sequence of Cholesky de- compositions and constraint satisfactions, is given in appendix B. A block update is initiated if one of the following conditions is fulfilled: (1) a number bc of different Lagrange parameters hit either +C or −C since the previous block update, (2) a number bn of different Lagrange parameters were changed since the previous block update, or (3) the number of different Lagrange parameters that were updated since the previous block update reaches b f times the number of SMO iterations since the previous block update. The first and second condition limit the computational complexity and number of block updates, while the latter condition avoids extended oscillations of the Lagrange parameters. A convenient choice is bc = 4, bn = 21, and b f = 3. The block update is then performed using all Lagrange parameters that are nonzero and are changed at least once since the previous block update.

3.3  Annealing. Hochreiter and Obermayer (2006a) showed that the number of support vectors depends on the regularization parameter .If is large, many constraints are fulfilled, and the number of support vectors is low. The smaller gets, the larger the number of support vectors becomes. Since SMO algorithms are fast for problems with a small number of sup- port vectors, we suggest solving the dual optimization using an annealing strategy for the parameter . Since all KKT conditions (see equations 2.16 and 2.17) are met for ≥KT y∞, annealing starts with a lower value of , for example, = 0.1 ·KT y∞. is then decreased between SMO itera- tions according to an exponential schedule (e.g., new = 0.9 · old). Since it is not important to find the optimum of equations 2.19 for intermediate values of , the termination parameter λann for the SMO procedure is set to λann = 4λ until the selected final value of is reached. An SMO Algorithm for the Potential Support Vector Machine 281

4 Experiments

In this section we compare our P-SVM implementation with the -SVR (regression, hyperparameters and C) and C-SVC (classification, hyper- parameter C) from the libsvm implementation (Chang & Lin, 2001) with respect to the computational costs of training and testing, the number of support vectors, and the estimated generalization error of the final pre- dictor. Computing time (in seconds) is measured on a 2 GHz workstation. Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 Training time refers to the number of seconds of CPU time needed un- til a convergence criterion was reached (P-SVM: λ = 0.05 C-SVC, -SVR: δterm = 0.001). The P-SVM is applied using the dual-SMO, block opti- mization, and -annealing. Testing time refers to the number of seconds spent predicting the attributes of a test data set of a given size. Bench- mark data sets were chosen from the UCI repository (Newman, Hettich, Blake, & Merz, 1998) and from http:// ni.cs.tu-berlin.de/software/psvm, and both hyperparameters and C were optimized by grid search for each data set. Table 2 shows the benchmark results. Libsvm is faster during training but needs many more support vectors than the P-SVM to obtain its best result. The cross-validation mean squared error (MSE) of the P-SVM is on average below the one of libsvm. The P-SVM learning procedure is computationally more expensive than the libSVM implementation of standard SVMs, but the number of support vectors, that is, the computing time during test, is much less. Table 3 shows benchmark results for the different SMO-variants for the P-SVM implementation, including single- versus dual-SMO, block opti- mization, and -annealing. Best results are obtained for the dual SMO im- plementation using both block optimization and -annealing. In numerical simulation, one often observes that support vectors for high-values remain support vectors for lower values. Since the number of Lagrange multipliers that are changed at least once during the SMO procedure is always greater than or equal to the final number of nonzero elements, -annealing signifi- cantly reduces the number of Lagrange multipliers that are changed at least once, but needs a somewhat larger number of SMO iterations. If the reduc- tion in CPU time due to the smaller number of rows of Q, which need to be calculated (O(P · L)perrowof Q), is larger than the increase in time due to the larger number of SMO iterations, -annealing should be used. This is usually the case if block optimization and the dual-SMO are applied. Then -annealing provides an additional computational advantage (see Table 3). Table 4 shows how computation time scales with the number of training examples for the data set Adult from the UCI repository if an RBF kernel is applied. Training time is approximately a factor of two larger for the P-SVM, but on the same order of magnitude for the same generalization performance (79%, five-fold cross-validation). The SMO implementation 282 T. Knebel, S. Hochreiter, and K. Obermayer

Table 2: Comparison Between the P-SVM and the libSVM Implementations of -SVR and C-SVC with Respect to Number of Support Vectors, CPU Time for Training (Seconds), CPU Time for Testing (Seconds), Mean Squared Generaliza- tion Error (MSE) for -SVR, and Percentage of Correctly Classified Data Points for C-SVC, Determined Using n-Fold Cross-Validation.

Data Set P-SVM libSVM Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 Abalone (UCI) C = 5000, = 0.003 C = 110, = 1.8 RBF kernel, γ = 1 Number of SVs: 71 Number of SVs: 1126 4177 examples Training time = 45 sec Training time = 5.2sec 8 features Testing time = 0.89 sec Testing time = 18 sec Regression MSE (20-fold) = 4.417 MSE (20-fold) = 4.419 Space (UCI, rescaled) C = 1000, = 0.0006 C = 100, = 0.01 RBF kernel, γ = 1 Number of SVs: 86 Number of SVs: 2786 3107 examples Training time = 60 sec Training time = 34 sec 6 features Testing time = 1.0 sec Testing time = 36 sec Regression MSE (10-fold) = 0.011 MSE (10-fold) = 0.010 Cluster2x (psvm) C = 100, = 0.0016 C = 70, = 0.19 − RBF kernel, γ = 10 4 Number of SVs: 139 Number of SVs: 564 1000 examples Training time = 4.5 sec Training time = 1.3sec 15 features Testing time = 2.0 sec Testing time = 13 sec Regression MSE (20-fold) = 0.1050 MSE (20-fold) = 0.1187 Cluster2a (psvm) C = 28, = 0.0013 C = 1000, = 0.6 − RBF kernel, γ = 10 4 Number of SVs: 149 Number of SVs: 130 200 examples Training time = 0.26 sec Training time = 0.32 sec 10 features Testing time = 2.0 sec Testing time = 4.5sec Regression MSE (30-fold) = 1.17 MSE (30-fold) = 3.09 Heart (UCI, rescaled) C = 1000, = 0.44 C = 0.6, = 0.41 RBF kernel, γ = 0.05 Number of SVs: 9 Number of SVs: 131 270 examples Training time = 0.031 sec Training time = 0.11 sec 13 features Testing time = 0.12 sec Testing time = 5.5sec Regression MSE (50-fold) = 0.483 MSE (50-fold) = 0.495 Arcene (NIPS2003) C = 100, = 0.002 C = 100 Linear kernel Number of SVs: 83 Number of SVs: 79 100 examples Training time = 0.30 sec Training time = 0.59 sec 10, 000 features Testing time = 0.33 sec Testing time = 1.6sec Classification Accuracy = 86% Accuracy = 87% (30-fold) (30-fold)

Notes: CPU time for testing is obtained for a data set of 105 (Arcene: 100) examples. γ denotes the width of the selected RBF kernel. Each feature of the data sets Heart and Space was linearly scaled to the range [−1, +1]. The SMO parameters for the P-SVM are λ = 0.05,λann = 4, bc = 4, bn = 21, b f = 3. For libSVM we used the default configuration δterm = 0.001, shrinking activated, and kernel caching with 40 MB. of the P-SVM is therefore competitive with standard SVM methods for standard SVM problems. The P-SVM, however, is the only method so far that can handle indefinite kernel functions and arbitrary dyadic data. This feature can lead to an enormous speed-up of the P-SVM compared to standard SVM methods, as An SMO Algorithm for the Potential Support Vector Machine 283

Table 3: Benchmark Result for SMO Variants of the P-SVM Implementation. dbl blk ann Abalone Space Cluster2x Arcene

+++ 45 (255) 60 (427) 4.5 (366) 0.30 (194) ++− 90 (1147) 82 (1351) 4.9 (878) 0.12 (195) +−+1618 (260) 663 (373) 67 (347) 1.26 (192) +−−1517 (982) 623 (1282) 16 (815) 0.51 (194)

−++ 101 (201) 60 (343) 8.8 (363) 1.68 (188) Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 −+− 85 (329) 56 (580) 6.9 (459) 0.75 (191) −−+5850 (156) 1214 (240) 184 (351) 27 (188) −−−4848 (182) 1030 (300) 171 (370) 13 (187)

Notes: The columns dbl (1), blk (2), and ann (3) indicate whether the dual-SMO instead of the single-SMO (1), block optimization (2), or -annealing (3) was applied. Numbers denote the CPU time (in seconds) needed for solving the dual problem of the PSVM using hyperparameters that minimize the cross-validation error. The total number of evaluated rows of Qis given by numbers in parentheses and equal the number of different Lagrange parameters changed at least once during the SMO procedure.

Table 4: Comparison Between the P-SVM and C-SVC of the libSVM Implemen- tation for the Data Set Adult with Respect to the CPU Time τ (inSeconds)asa Function of Training Set Size.

Number of Examples P-SVM (τn) P-SVM (n Q) C-SVC (τn)

1000 0.6 76 1 2000 3.1 142 3 3000 8.0 166 8 4000 17 202 9 5000 29 226 14 6000 45 243 20 7000 67 270 35 8000 96 292 42 9000 132 329 55 10000 166 330 63

Notes: n Q denotes the total number of evaluated rows of Q. − The parameters are RBF kernel γ = 10 8, P-SVM: = 0.4, C = 1, C-SVC: C = 10.

shown in Table 5 for the data set Adult. Table 5 shows benchmark results of the P-SVM in dyadic data mode and the C-SVC of libSVM for a linear kernel. While the generalization performance (measured through five-fold cross-validation) of 84% for the P-SVM converged to 79% for the C-SVC, the computation time is a factor of 8554. CPU time is roughly proportional to the number of training objects for both the P-SVM and C-SVC, while the training time is much less for the P-SVM. 284 T. Knebel, S. Hochreiter, and K. Obermayer

Table 5: Comparison Between the P-SVM and the C-SVC of the libSVM Imple- mentation for the Data Set Adult with Respect to CPU Time τ (in Seconds) as a Function of Training Set Size.

Number of Examples P-SVM (τn) C-SVC (τn)

10,000 0.30 3195 20,000 0.58 6780 30,000 0.97 9810 Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021 40,000 1.48 12,660

Notes: The P-SVM operates in dyadic mode di- rectly on the data set, while a linear kernel is used for the C-SVC. The parameters are = 1, C = 1000 for the P-SVM, and C = 1 for the C-SVC.

Appendix A: The Single SMO Procedure

Due to the missing equality constraint in the dual problem of the P-SVM, an SMO procedure can be derived where only one Langrange multiplier is optimized in every iteration. Setting the derivatives of equation 2.7 with respect to αi to zero and using equations 3.1, we obtain

! Qii · αi + Fi + si = 0, (A.1) where αold + α ≥ = 1if i i 0 . si − αold + α < 1if i i 0

Note that K is normalized (cf. Hochreiter & Obermayer, 2006a) so that the diagonal elements of Q are equal to 1. The unconstrained minimum of equation 2.1 with respect to αi is then given by

αnew = αold + α , i i i where

αi =−Fi − si . (A.2)

The evaluation of equation A.2 requires an estimation of si . We assume αnew < αold < =− αold > αnew > i 0if i 0, which implies si 1. If i 0, we assume i 0, = αold = = =− which implies si 1. If i 0, we consider both si 1andsi 1and | | use the solution with the largest αi . The application of the box constraints is similar to the dual case described in section 3. An SMO Algorithm for the Potential Support Vector Machine 285

Appendix B: A Recursive Algorithm for the Block Update

The block update procedure for solving equations 2.19 can be described using the following pseudocode:

w function αne = optimize(αold)

{ Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021

w αne := solvecholesky(αold) w w αne := bound(αne ) w w αne := ListOf BoundedAlphas(αne ) bound w if αne > 4 return αold bound αnew > αnew = αnew, αnew , αnew . if bound 0 : reoptimize( bound bound front) w return αne }

αnew = αold, αold , function reoptimize( bound b) {

w αne := optimize(αold \ b) | apply(b) ∀ ∈ αold \ i bound b w i ne := find element from αnew which belongs to i w if (i ne == i) αnew = αold, αold \ , : reoptimize( bound b i) w return αne endif w return αne }

The central function is optimize, which recursively uses optimize and reoptimize. αold and αnew are sets of Lagrange multipliers that are imple- αold αnew mented as linked lists of variable lengths. bound and bound are subsets of the αold αnew αold αnew corresponding and . b and i are single elements of bound or bound. The function solvecholesky solves the unconstrained problem using the Cholesky decomposition, and the function bound sets all Lagrange parame- ters that violate a constraint to their nearest bounds. ListOf BoundedAlphas 286 T. Knebel, S. Hochreiter, and K. Obermayer returns a linked list of Lagrange multipliers whose values are on a bound. The function optimize is followed by “| apply(b)”, which updates F (see equation 2.9) using the Lagrange parameter b. After using the modified F during the corresponding optimization procedure, F is restored to its old values. The operator α \ b removes Lagrange parameter b from the linked list of Lagrange parameters α. The algorithm terminates after find- ing an exact solution or if more than four Lagrange parameters reach the bound. Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021

Acknowledgments

We thank A. Paus, C. Sannelli, and J. Mohr for testing the SMO. This work was funded by the BMBF (grant no. 01GQ0414) and BMWA (grant no. 01MD501).

References

Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: A library for support vector machines. Software available online at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Hochreiter, S., & Obermayer, K. (2004). Gene selection for microarray data.InB. Scholkopf,¨ K. Tsuda, & J.-P. Vert (Eds.), Kernel methods in computational biology (pp. 319–355). Cambridge, MA: MIT Press. Hochreiter, S., & Obermayer, K. (2006a). Support vector machines for dyadic data. Neural Computation, 18(6), 1472–1510. Hochreiter, S., & Obermayer, K. (2006b) Nonlinear feature selection with the poten- tial support vector machine. In I. Guyon, S. Gunn, M. Nikravesh, & L. Zadeh (Eds.), Feature extraction: foundations and applications (pp. 419–438). New York: Springer. Hoff, P. D. (2005). Bilinear mixed-effects models for dyadic data. Journal of the Amer- ican Statistical Assosciation, 100(469), 286–295. Hofmann, T., & Puzicha, J. (1998). Unsupervised learning from dyadic data.(Tech.Rep. TRT-98-042). Berkeley, CA: International Computer Science Insititute. Keerthi, S. S., & Shevade, S. K. (2003). SMO algorithm for least-squares SVM formu- lations. Neural Computation, 15(2), 487–507. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improve- ments to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637–649. Lai, D., Mani, N., & Palaniswami, M. (2003). A new method to select working sets for faster training for support vector machines (Tech. Rep. MESCE-30-2003). Clayton, Victoria, Australia: Department of Electrical and Computer Systems Engineering, Monash University. Newman, D. J., Hettich, S., Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. Irvine: University of California, Department of Information and Computer Science. Available online at http://www.ics.uci.edu/∼mlearn/ MLRepository.html. An SMO Algorithm for the Potential Support Vector Machine 287

Platt, J. (1999). Fast training of support vector machines using sequential min- imal optimization. In B. Scholkopf,C.J.C.Burges,&A.J.Smola(Eds.),¨ Advances in kernel methods—support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Scholkopf,¨ B., & Smola, A. J. (2002). Learning with kernels—support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Downloaded from http://direct.mit.edu/neco/article-pdf/20/1/271/817177/neco.2008.20.1.271.pdf by guest on 02 October 2021

Received June 6, 2006; accepted February 13, 2007.