Support Vector Machines for Dyadic Data
Sepp Hochreiter and Klaus Obermayer Department of Electrical Engineering and Computer Science Technische Universit at Berlin 10587 Berlin, Germany {hochreit,oby}@cs.tu-berlin.de
Abstract We describe a new technique for the analysis of dyadic data, where two sets of objects (“row” and “column” objects) are characterized by a matrix of numerical values which describe their mutual relationships. The new technique, called “Potential Support Vector Machine” (P-SVM), is a large-margin method for the construction of classi ers and regression functions for the “column” objects. Contrary to standard support vec- tor machine approaches, the P-SVM minimizes a scale-invariant capacity measure and requires a new set of constraints. As a result, the P-SVM method leads to a usually sparse expansion of the classi cation and regres- sion functions in terms of the “row” rather than the “column” objects and can handle data and kernel matrices which are neither positive de nite nor square. We then describe two complementary regularization schemes. The rst scheme improves generalization performance for classi cation and regression tasks, the second scheme leads to the selection of a small, informative set of “row” “support” objects and can be applied to feature selection. Benchmarks for classi cation, regression, and feature selection tasks are performed with toy data as well as with several real world data sets. The results show, that the new method is at least competitive with but often performs better than the benchmarked standard methods for standard vectorial as well as for true dyadic data sets. In addition, a theoretical justi cation is provided for the new approach.
1 Introduction
Learning from examples in order to predict is one of the standard tasks in ma- chine learning. Many techniques have been developed to solve classi cation and regression problems, but by far most of them were speci cally designed for vectorial data. Vectorial data, where data objects are described by vectors of numbers which are treated as elements of a vector space, are very convenient, because of the structure imposed by the Euclidean metric. However, there are
1 datasets for which a vector-based description is inconvenient or simply wrong, and representations which consider relationships between objects, are more ap- propriate.
A) a b c d B) a b c d e f g
α 1.3 −2.2 −1.6 7.8 a 0.9 −0.1 −0.8 0.5 0.2 −0.5 −0.7
β −1.8 −1.1 7.2 2.3 b −0.1 0.9 0.6 0.3 −0.7 −0.6 0.3
χ 1.2 1.9 −2.9 −2.2 c −0.8 0.6 0.9 0.2 −0.6 0.6 0.5
δ 3.7 0.8 −0.6 2.5 d 0.5 0.3 0.2 0.9 0.7 0.1 0.3
ε 9.2 −9.4 −8.3 9.2 e 0.2 −0.7 −0.6 0.7 0.9 −0.9 −0.5
φ −7.7 8.6 −9.7 −7.4 f −0.5 −0.6 0.6 0.1 −0.9 0.9 0.9
γ −4.8 0.1 −1.2 0.9 g −0.7 0.3 0.5 0.3 −0.5 0.9 0.9
Figure 1: A) Dyadic data: Column objects a, b, . . . and row objects , , . . . are represented by a matrix of numerical {values whic} h describe their{ mutual} relationships. B) Pairwise data: A special case of dyadic data, where row and column objects coincide.
In the following we will study representations of data objects which are based on matrices. We consider “column” objects, which are the objects to be described, and “row” objects, which are the objects which serve for their description (cf. Fig. 1A). The whole dataset can then be represented using a rectangular matrix whose entries denote the relationships between the corre- sponding “row” and “column” objects. We will call representations of this form dyadic data (Hofmann and Puzicha, 1998; Li and Loken, 2002; Ho , 2005). If “row” and “column” objects are from the same set (Fig. 1B), the representa- tion is usually called pairwise data, and the entries of the matrix can often be interpreted as the degree of similarity (or dissimilarity) between objects. Dyadic descriptions are more powerful than vector-based descriptions, as vectorial data can always be brought into dyadic form when required. This is often done for kernel-based classi ers or regression functions (Sch olkopf and Smola, 2002; Vapnik, 1998), where a Gram matrix of mutual similarities is calculated before the predictor is learned. A similar procedure can also be used in cases where the “row” and “column” objects are from di erent sets. If both of them are described by feature vectors, a matrix can be calculated by applying a kernel function to pairs of feature vectors, one vector describing a “row” and the other vector describing a “column” object. One example for this is the drug-gene matrix of Scherf et al. (2000), which was constructed as the product of a measured drug-sample and a measured sample-gene matrix and where the kernel function was a scalar product.
2 In many cases, however, dyadic descriptions emerge, because the matrix en- tries are measured directly. Pairwise data representations as a special case of dyadic data can be found for datasets where similarities or distances between objects are measured. Examples include sequential or biophysical similarities between proteins (Lipman and Pearson, 1985; Sigrist et al., 2002; Falquet et al., 2002), chromosome location or co-expression similarities between genes (Cre- mer et al., 1993; Lu et al., 1994; Heyer et al., 1999), co-citations of text docu- ments (White and McCain, 1989; Bayer et al., 1990; Ahlgren et al., 2003), or hyperlinks between web-pages (Kleinberg, 1999). In general, these measured matrices are symmetric but may not be positive de nite, and even if they are for the training set, they may not remain positive de nite, if new examples are included. Genuine dyadic data occur whenever two sets of objects are related. Examples are DNA microarray data (Southern, 1988; Lysov et al., 1988; Dr- manac et al., 1989; Bains and Smith, 1988), where the “column” objects are tissue samples, the “row” objects are genes, and every sample-gene pair is re- lated by the expression level of this particular gene in this particular sample. Other examples are web-documents, where both the “column” and “row” ob- jects are web-pages and “column” web-pages are described by either hyperlinks to or from the “row” objects, which give rise to a rectangular matrix1. Fur- ther examples include documents in a database described by word-frequencies (Salton, 1968) or molecules described by transferable atom equivalent (TAE) descriptors (Mazza et al., 2001). Traditionally, “row” objects have often been called “features” and “column” vectors of the dyadic data matrix have mostly been treated as “feature vectors” which live in an Euclidean vector space, even when the dyadic nature the data was made explicit, see (Graepel et al., 1999; Mangasarian, 1998) or the “feature map” method (Sch olkopf and Smola, 2002). Di culties, however, arise when features are heterogeneous, and “apples and oranges” must be compared. What theoretical arguments would, for example, justify, to treat the values of a set of TAE descriptors as coordinates of a vector in an Euclidean space? A “non-vectorial” approach to pairwise data is to interpret the data matrix as a Gram matrix and to apply support vector machines (SVM) for classi cation and regression if the data matrix is positive semide nite (Graepel et al., 1999). For inde nite (but symmetric) matrices two other non-vectorial approaches have been suggested (Graepel et al., 1999). In the rst approach, the data matrix is projected into the subspace spanned by the eigenvectors with positive eigenval- ues. This is an approximation and one would expect good results only, if the absolute values of the negative eigenvalues are small compared to the dominant positive ones. In the second approach, directions of negative eigenvalues are processed by ipping the sign of these eigenvalues. All three approaches lead to positive semide nite matrices on the training set, but positive semide niteness is not assured, if a new test object must be included. An embedding approach was suggested by Herbrich et al. (1998) for antisymmetric matrices, but this was
1Note, that for pairwise data examples the linking matrix was symmetric because links were considered bidirectional.
3 speci cally designed for data sets, where the matrix entries denote preference relations between objects. In summary, no general and principled method exists to learn classi ers or regression functions for dyadic data. In order to avoid abovementioned shortcomings, we suggest to consider “col- umn” and “row” objects on an equal footing and interpret the matrix entries as the result of a kernel function or measurement kernel, which takes a “row” object, applies it to a “column” object, and outputs a number. It will turn out that mild conditions on this kernel su ce to create a vector space endowed with a dot product into which the “row” and the “column” objects can be mapped. Using this mathematical argument as a justi cation, we show how to construct classi cation and regression functions in this vector space in analogy to the large margin based methods for learning perceptrons for vectorial data. Using an im- proved measure for model complexity and a new set of constraints which ensure a good performance on the training data, we arrive at a generally applicable method to learn predictors for dyadic data. The new method is called the “po- tential support vector machine” (P-SVM). It can handle rectangular matrices as well as pairwise data whose matrices are not necessarily positive semide nite, but even when the P-SVM is applied to regular Gram matrices, it shows very good results when compared with standard methods. Due to the choice of con- straints, the nal predictor is expanded into a usually sparse set of descriptive “row” objects, which is di erent from the standard SVM expansion in terms of the “column” objects. This opens up another important application domain: a sparse expansion is equivalent to feature selection (Guyon and Elissee , 2003; Hochreiter and Obermayer, 2004b; Kohavi and John, 1997; Blum and Langley, 1997). An e cient implementation of the P-SVM using a modi ed sequential minimal optimization procedure for learning is described in (Hochreiter and Obermayer, 2004a).
2 The Potential Support Vector Machine 2.1 A Scale-Invariant Objective Function Consider a set xi 1 i L of objects which are described by feature | X vectors xi RN and which form a training set X = x1 , . . . , xL . The index ∈ © ª “ ” is introduced, because we will later use the kerneln trick andoassume that i the vectors x are images of a map which is induced either by a kernel or by a measurement function. Assume for the moment a simple binary classi cation problem, where class membership is indicated by labels yi +1, 1 , , denotes the dot product, and a linear classi er parameterized∈ {by the }weighh it vector w and the o set b is chosen from the set
sign (f(x )) f(x ) = w, x + b . (1) { | h i }
4 Standard SVM-techniques select the classi er with the largest margin under the constraint of correct classi cation on the training set: 1 min w 2 (2) w,b 2 k k s.t. y w, xi + b 1 . i If the training data ¡are not®linearly¢ separable, a large margin is traded against a small training error using a suitable regularization scheme. The maximum margin objective is motivated by bounds on the generalization error using the Vapnik-Chervonenkis (VC) dimension h as capacity measure (Vapnik, 1998). For the set of all linear classi ers de ned on X , for which 2 2 min holds, one obtains h min R / min , N + 1 (see Vapnik, 1998; Sch olkopf and Smola, 2002). [ ] denotes the integer part, is the margin, and R is the radius of the smallest sphere ©£in data space,¤ whicª h contains all the training data. Bounds derived using the fat shattering dimension (Shawe-Taylor et al., 1996, 1998; Sch olkopf and Smola, 2002), and bounds on the expected generalization error (cf. Vapnik, 1998; Sch olkopf and Smola, 2002) depend on R/ min in a similar manner. Both the selection of a classi er using the maximum margin principle and the values obtained for the generalization error bounds su er from the problem that
they are not invariant under linear transformations. This problem is illustrated
¡
¢¡
¥
¢¥
¤
£ £
Figure 2: LEFT: data points from two classes (triangles and circles) are sepa- rated by the hyperplane with the largest margin (solid line). The two support vectors (black symbols) are separated by dx along the horizontal and by dy 2 2 2 2 2 2 2 2 along the vertical axis, 4 = dx + dy and R / 4 = R / dx + dy . The dashed line indicates the classi cation boundary of the classi er shown on the right, scaled along the vertical axis by the¡factor¢ 1/s. RIGHT:¡ the¢ same data but scaled along the vertical axis by the factor s. The solid 2 2 2 2 line denotes the maximum margin hyperplane, 4 = dx + s dy and 2 2 2 2 2 2 R / 4 = R / dx + s dy . For dy = 0 both the margin and the term R2/ 2 depend on s. 6 ¡ ¢ ¡ ¢
5 in Fig. 2 for a 2D classi cation problem. In the left gure, both classes are separated by the hyperplane with the largest margin (solid line). In the right gure, the same classi cation problem is shown, but scaled along the vertical axis by a factor s. Again, the solid line denotes the support vector solution, but when the classi er is scaled back to s = 1 (dashed line in the left gure) it does no longer coincide with the original SVM solution. The ratio R2/ 2, which bounds the VC dimension, also depends on the scale factors (see legend of Fig. 2). The situation depicted in Fig. 2 occurs whenever the data can be enclosed by a (non-spherical) ellipsoid (cf. Sch olkopf et al., 1999). The considerations of Fig. 2 show, that the optimal hyperplane is not scale invariant and predictions of class labels may change if the data is rescaled before learning. Thus the question arises, which scale factors should be used for classi er selection. Here we suggest to scale the training data such that the margin remains constant while the radius R of the sphere containing all training data becomes as small as possible. The result is a new sphere with radius R˜ which leads to a tighter margin-based bound for the generalization error. Optimality is achieved when all directions orthogonal to the normal vector w of the hyperplane with maximal margin are scaled to zero and R˜ = min R max wˆ, xi + t t∈ i i ¯D E i¯ maxi wˆ, x , where wˆ := w/ w . If t is small compared¯ to wˆ, x ¯ , k k | | ¯ ¯ e.g. if¯Dthe dataE¯ is centered at the origin, t can be neglected through¯D aboEv¯e ¯ ¯ ¯ ¯ inequalit¯ y. Unfortunately¯ , above formulation does not lead to an optimization¯ ¯ problem which is easy to implement. Therefore, we suggest to minimize the upper bound: ˜ 2 R ˜ 2 2 i 2 i 2 > 2 = R w max w, x w, x = X w , (3) 2 k k i i ® X ® ° ° ° ° 1 2 L i where the matrix X := x , x , . . . , x contains the training vectors x . In the second inequality the³maximum norm´ is bounded by the Euclidean norm. Its worst case factor is L, but the bound is tight. The new objective is well > > de ned also for cases where X X or/and X X is singular, and the kernel trick carries over to the new technique. It can be shown that replacing the objective function w 2 (eqs. (2)) by the upper bound k k
> > > 2 w X X w = X w (4)
R˜ 2 ° ° on 2 , eq. (3), corresponds° to the° integration of sphering (whitening) and SVM learning if the data have zero mean. If the data has already been sphered, then > the covariance matrix is given by X X = I and we recover the classical SVM. If not, minimizing the new objective leads to normal vectors which are rotated towards directions of low variance of the data when compared with the standard maximum margin solution. Note, however, that whitening can easily be performed in input space but becomes nontrivial if the data is mapped to a high-dimensional feature space
6 using a kernel function. In order to test, whether the situation depicted in Fig. 2 actually occurs in practice, we applied a C-SVM with an RBF-kernel to the UCI data set “breast-cancer” from Subsection 3.1.1. The kernel width was chosen small and C large enough, so that no error occured on the training set (to be consistent with Fig. 2). Sphering was performed in feature space by replacing the objective function in eq. (2) with the new objective, eq. (4). Fig. 3 shows the angle
w , w = arccos h svm sphi = (5) Ã wsvm, wsvm wsph, wsph ! h i h i p psvm sph i j i,j i j yi k x , x arccos svm svm i j sph sph i j i,j i j Pyi yj k (x , x ) ¡ i,j i¢ j k (x , x ) q q between the hyperplaneP for the sphered (“sph”) and theP non-sphered (“svm”) case as a function of the width of the RBF-kernel. The observed values range between 6 and 50 providing evidence for the situation shown in Fig. 2. 50 45 Figure 3: Application of a C-SVM 40 to the data set “breast-cancer” of the 35 UCI repository. The angle (eq. (5)) 30 between weight vectors wsvm and wsph 25 is plotted as a function of (C = 20 in degree 1000). For 0 the data is al- ψ 15 ready approximately→ sphered in fea- 10 ture space, hence the objective func- 5 0.5 1 1.5 2 2.5 3 3.5 4 tions from eqs. (2) and (4) lead to sim- kernel width σ ilar results.
The new objective function, eq. (4), leads to separating hyperplanes which are invariant under linear transformations of the data. As a consequence, neither the bounds nor the performance of the derived classi er depends on how the training data was scaled. But is the new objective function also related to a capacity measure for the model class as the margin is? It is, and in (Hochreiter and Obermayer, 2004c) it has been shown, that the capacity measure, eq. (4), emerges when a bound for the generalization error is constructed using the technique of covering numbers.
2.2 Constraints for Complex Features The next step is to formulate a set of constraints which enforce a good perfor- mance on the training set and which implement the new idea that “row” and “column” objects are both mapped into a Hilbert space within which matrix en- tries correspond to scalar products and the classi cation or regression function is constructed.
7 Assume that each matrix entry was measured by a device which determines the projection of an object feature vector x onto a direction zω in feature j i space. The value Kij of such a complex feature zω for an object x is then given by the dot product
i j Kij = x , zω . (6)
In analogy to the index® “ ” for x , the index “ω” indicates that we will later i RN assume that the vectors zω are images in of a map ω which is induced by either a kernel or a measurement function. A mathematical foundation of the ansatz eq. (6) is given in Appendix A. Assume that we have P devices whose corresponding vectors zω are the 1 2 P only measurable and accessible directions and let Zω := zω, zω, . . . , zω be the matrix of all complex features. Then we can summarize our (incomplete) ¡ ¢ knowledge about the objects in X using the data matrix
> K = X Zω . (7) For DNA microarray data, for example, we could identify K with the matrix of expression values obtained by a microarray experiment. For web data we could identify K with the matrix of ingoing or outgoing hyperlinks. For a document data set we could identify K with the matrix of word frequencies. Hence we assume, that x and zω live in a space of hidden causes which are responsible for the di erent attributes of the objects. The complex features j zω span a subspace of the original feature space, but we do not require them j j to be orthogonal, normalized, or linearly independent. If we set zω = e (jth © ª i Cartesian unit vector), that is Zω = I, Kij = xj and P = N, the “new” description, eq. (7), is fully equivalent to the “old” description using the original feature vectors x . We now de ne the quality measure for the performance of the classi er or the regression function on the training set. We consider the quadratic loss function 1 c(y , f(xi )) = r2 , r = f(xi ) y = w, xi + b y . (8) i 2 i i i i The the mean squared error is ®
1 L R [f ] = c y , f(xi ) . (9) emp w,b L i i=1 X ¡ ¢ We now require, that the selected classi cation or regression function minimizes Remp, i.e. that
1 > wR [fw ] = X X w + b1 y = 0 (10) ∇ emp ,b L and ¡ ¢ ∂R [f] 1 1 emp = r = b + w, xi y = 0 , (11) ∂b L i L i i i X X ¡ ® ¢ 8 where the labels for all objects in the training set are summarized by a label vector y. Since the quadratic loss function is convex in w and b, only one > > minimum exists if X X has full rank. If X X is singular, then all points of minimal value correspond to a subspace of RN . From eq. (11) we obtain
1 L 1 b = w, xi y = w>X y> 1 . (12) L i L i=1 X ¡ ® ¢ ¡ ¢ Condition eq. (10) implies, that the directional derivative should be zero along any direction in feature space, including the directions of the complex feature vectors zω. We, therefore, obtain
dRemp f j w + t zω ,b j > = z wR [fw ] (13) h dt i ω ∇ emp ,b 1 > ¡ ¢ = zj X X>w + b1 y = 0 , L ω and, combining all¡ complex¢ features,¡ ¢ 1 1 Z> X X>w + b1 y = K>r = = 0 . (14) L ω L ¡ ¢ j Hence we require, that for every complex feature zω the mixed moments j between the residual error ri and the measured values Kij should be zero.
2.3 The Potential Support Vector Machine (P-SVM) 2.3.1 The Basic P-SVM The new objective from eq. (4) and the new constraints from eq. (14) constitute a new procedure for selecting a classi er or a regression function. The number of constraints is equal to the number P of complex features, which can be larger or smaller than the number L of data points or the dimension N of the original feature space. Because the mean squared error of a linear function fw,b is a convex function of the parameters w and b, the constraints enforce that Remp is minimal2. If K has at least rank L (number of training examples), then the minimum even corresponds to r = 0 (cf. eq. (14)). Consequently, over tting occurs and a regularization scheme is needed. Before a regularization scheme can be de ned, however, the mixed moments j must be normalized. The reason is, that high values of j may either be a result of a high variance of the values of Kij or the result of a high correlation between the residual error ri and the values of Kij . Since we are interested in the latter the most appropriate measure would be Pearson’s correlation coe cient L i=1 Kij Kj (ri r ) ˆj = , (15) 2 L P K ¡ K ¢ L (r r )2 i=1 ij j i=1 i q q 2 P >¡ b 1¢ P > Note, that w = “X ” (y ) ful lls the constraints, where “X ” is the pseudo- > inverse of X .
9 1 L 1 L where r = L i=1 ri and Kj = L i=1 Kij . If every column vec- tors (K1j , K2j , . . . , KLj ) of the data matrix K is normalized to mean zero and variance one, we obtainP P
1 L 1 = K r = ˆ r r 1 . (16) j L ij i j √ k k2 i=1 L X Because r = 0 (eq. (11)), the mixed moments are now proportional to the correlation coe cient ˆj with a proportionality constant which is independent j of the complex feature zω, and j can be used instead of ˆj to formulate the constraints. If the data vectors are normalized, the term K>1 vanishes and we obtain the basic P-SVM optimization problem
Basic P-SVM optimization problem 1 min X> w 2 (17) w 2 k k s.t. K> X> w y = 0 . ¡ ¢ The o set b of the classi cation or regression function is given by eq. (12) which simpli es after normalization to (see Hochreiter and Obermayer, 2004a)
1 L b = y . (18) L i i=1 X We will call this model selection procedure the Potential Support Vector Machine (P-SVM), and we will always assume normalized data vectors in the following.
2.3.2 The Kernel Trick Following the standard “support vector” way to derive learning rules for per- ceptrons, we have so far considered linear classi ers only. Most practical clas- si cation problems, however, require non-linear classi cation boundaries which makes the construction of a proper feature space necessary. In analogy to the standard SVM, we now invoke the kernel trick. Let xi and zj be feature vectors, which describe the “column” and the “row” objects of the dataset. We then choose a kernel function k xi, zj and compute the matrix K of relations between “column” and “row” objects: ¡ ¢ i j i j i j Kij = k x , z = x , ω z = x , zω , (19)
i i ¡ j¢ ¡ j¢ ¡ ¢® ® where x = x and zω = ω z . In Appendix A it is shown that any L2-kernel corresponds (for almost all points) to a dot product in a Hilbert ¡ ¢ ¡ ¢ space in the sense of eq. (19) and induces a mapping into a feature space within
10 which a linear classi er can then be constructed. In the following chapters we will, therefore, distinguish between the actual measurements xi and zj and the i j feature vectors x , and zω “induced” by the kernel k. Potential choices for “row” objects and their vectorial description are zj = xj, P = L, (standard construction of a Gram matrix), zj = ej, P = N, (“elementary” features), zj is the jth cluster center of a clustering algorithm applied to the vectors xi (example for a “complex” feature), or zj is the jth vector of an PCA or ICA preprocessing (another example for a “complex” feature). If the entries Kij of the data matrix are directly measured, the application of the kernel trick needs additional considerations. In appendix A we show, that - if the measurement process can be expressed through a kernel k xi, zj which takes a column object xi and a row object zj and outputs a number - the matrix K of relations between the “row” and “column” objects can be in¡ terpreted¢ as a dot product i j Kij = x , zω (20) i i j j in some features space, where x = x ®and zω = ω z . Note, that we distinguish between an object xi and its associated feature vectors xi or i ¡ ¢ ¡ ¢ x , leading to di erences in the de nition of k for the cases of vectorial and (measured) dyadic data. Eq. (20) justi es the P-SVM approach, which was derived for the case of linear predictors, also for measured data.
2.3.3 The P-SVM for Classi cation If the P-SVM is used for classi cation, we suggest a regularization scheme based on slack variables + and . Slack variables allow for small violations of individual constraints if the correct choice of w would lead to a large increase of the objective function otherwise. We obtain
1 > 2 > + min X w + C1 + (21) w,ξ+,ξ 2 k k s.t. K> X> w y +¡ + 0¢ K> X> w y 0 ¡ ¢ 0 +, ¡ ¢ for the primal problem. Above regularization scheme makes the optimization problem robust against “outliers”. A large value of a slack variable indicates, that the particular “row” object only weakly in uences the direction of the classi cation boundary, be- cause it would otherwise considerably increase the value of the complexity term. This happens in particular for high levels of measurement noise which leads to large, spurious values of the mixed moments j. If the noise is large, the value of C must be small to “remove” the corresponding constraints via the slack vari- ables . If the strength of the measurement noise is known, the correct value of C can be determined a priori. Otherwise, it must be determined using model selection techniques.
11 To derive the dual optimization problem, we evaluate the Lagrangian L, 1 L = w> X X> w + C1> + + (22) 2 > + K> X> w ¡y + + ¢ > + ¡ ¢ ¡K> ¡X> w y¢ ¢ > > ¡ + ¢ ¡+ ¡ ¢, ¢ (23) where the vectors¡ +¢ 0, ¡ 0¢, + 0, and 0 are the Lagrange multipliers for the constrain ts in eqs. (21). The optimalit y conditions require that
> wL = X X w X K (24) ∇ = X X> w X X>Z = 0 , ω where = + ( = + ). Eq. (24) is ful lled, if i i i w = Zω . (25)
In contrast to the standard SVM expansion of w into its support vectors x , the weight vector w is now expanded into a set of complex features zω which we will call “support features” in the following. We then arrive at the dual optimization problem:
P-SVM optimization problem for classi cation 1 min > K> K y> K (26) α 2 s.t. C 1 C1 , which now depends on the data via the kernel or data matrix K only. The dual problem can be solved by a Sequential Minimal Optimization (SMO) technique which is described in (Hochreiter and Obermayer, 2004a). The SMO technique is essential if many complex features are used, because the P P matrix K >K enters the dual formulation. Finally, the classi cation function f has to be constructed using the optimal values of the Lagrange parameters :
P-SVM classi cation function
P
f(x ) = j K(x)j + b , j=1 X where the expansion eq. (25) has been used for the weight vector w and b is given by eq. (18).
12 The classi er based on eq. (27) depends on the coe cients j, which were determined during optimization, on b, which can be computed directly, and on + the measured values K(x)j for the new object x. The coe cients j = j j can be interpreted as class indicators, because they separate the complex features into features which are relevant for class 1 and class -1, according to the sign of j . Note, that if we consider the Lagrange parameters j as parameters of the classi er, we nd that
dRemp f j w + t zω , b ∂Remp[f] = j = . (27) h dt i ∂ j
The directional derivatives of the empirical error Remp along the complex fea- tures in the primal formulation correspond to its partial derivatives with respect to the corresponding Lagrange parameter in the dual formulation. One of the most crucial properties of the P-SVM procedure is, that the dual optimization problem only depends on K via K>K. Therefore, K is neither required to be positive semide nite nor to be square. This allows not only the construction of SVM-based classi ers for matrices K of general shape but also to extend SVM-based approaches to the new class of inde nite kernels operating on the objects’ feature vectors. In the following we illustrate the application of the P-SVM approach to classi cation using a toy example. The data set consists of 70 objects x, 28 from class 1 and 42 from class 2, which are described by 2D-feature vectors x (open and solid circles in Fig. 4). Two pairwise data sets were then generated by applying a standard RBF-kernel and the (inde nite) sine-kernel k xi, xj = sin xi xj which leads to an inde nite Gram matrix. Fig. 4 shows the ¡ ¢ classi cationk result.k The sine-kernel is more appropriate than the RBF-kernel for ¡this data set¢ because it is better adjusted to the “oscillatory” regions of class membership, leading to a smaller number of support vectors and to a smaller test error. In general, the value of has to be selected using standard model selection techniques. A large value of leads to a more “complex” set of classi ers, reduces the classi cation error on the training set, but increases the error on the test set. Non-Mercer kernels extend the range of kernels which are currently used and, therefore, opens up a new direction of research for kernel design.
2.3.4 The P-SVM for Regression The new objective function of eq. (4) was motivated for a classi cation problem but it can also be used to nd an optimal regression function in a regression 2 2 task. In regression the term X> w = X> wˆ w 2, wˆ := w/ w , k k k k is the product of a term whic°h expresses° the° deviation° of the data from the ° ° ° ° zero-isosurface of the regression° function° and°a term°which corresponds to the smoothness of the regressor. The smoothness of the regression function is deter- i i mined by the norm of the weight vector w. If f(x ) = w, x + b and the D E 13 kernel: RBF, SV: 50, σ: 0.1, C: 100 kernel: sine, SV: 7, θ: 25, C: 100 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
Figure 4: Application of the P-SVM method to a toy classi cation problem. Ob- jects are described by 2D-feature vectors x. 70 feature vectors were generated, 28 for class 1 (open circles) and 42 for class 2 (solid circles). A Gram matrix i j 1 i j 2 was constructed using the kernels “RBF” k x , x = exp 2 x x k k (left) and “sine” k xi, xj = sin xi xj (right). White and gray indi- ¡ ¢ ¡ ¢ cate regions of class 1 and class 2 memk bership. k Circled data indicate support ¡ ¢ ¡ ¢ vectors (“SV”). For parameters see gure.
length of the vectors x is bounded by B, then the deviation of f from o set b is bounded by: f(xi ) b = w, xi w xi w B , (28) k k k k where°the rst “ ”°follows° from ®the° Cauchy-Sc° hw°arz inequality, hence a smaller ° ° ° ° ° ° value of w leads to a smoother regression function. This tradeo between smoothnessk kand residual error is also re ected by eq. (59) in Appendix A which shows that eq. (4) is the L2-norm of the function f. The constraints of vanishing mixed moments carry over to regression problems with the only modi cation, that the target values yi in eqs. (21) are real rather than binary numbers. The constraints are even more “natural” for regression because the ri are indeed the residuals a regression function should minimize. We, therefore, propose to use the primal optimization problem, eqs. (21), and its corresponding dual, eqs. (26), also for the regression setting. Fig. 5 shows the application of the P-SVM to a toy regression example (pairwise data). 50 data points are randomly chosen from the true function (dashed line) and i.i.d. Gaussian noise with mean 0 and standard deviation 0.2 is added to each y-component. One outlier was added at x = 0. The gure shows the P-SVM regression results (solid lines) for an RBF-kernel and three di erent combinations of C and . The hyperparameter C controls the residual error: A smaller value of C increases the residual error at x = 0 but also the number of support vectors. The width of the RBF-kernel controls the overall smoothness of the regressor: A larger value of increases the error at x = 0 without increasing the number of support vectors (arrows in bold in Fig. 5).
14 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 SV: 7 3 SV: 5 2 σ: 2, C: 20 2 σ: 4, C: 20 1 1 x = −2 x = −2 0 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10
10 9 8 7 6 5 4 3 SV: 14 2 σ: 2, C: 0.6 1 x = −2 0 −10 −8 −6 −4 −2 0 2 4 6 8 10
Figure 5: Application of the P-SVM method to a toy regression problem. Ob- jects (small dots), described by the x-coordinate, were generated by randomly choosing a value from [ 10, 10] and calculating the y-value from the true func- tion (dashed line) plus Gaussian noise (0, 0.2). One outlier was added at x = 0 (thin arrows). A Gram matrixNwas generated using an RBF-kernel i j 1 i j 2 k x , x = exp 2 x x . The solid lines show the regression result. Circled dots indicate k supp ortkvectors (“SV”). For parameters see gure. Bold¡ arro¢ws mark x¡= 2. ¢
2.3.5 The P-SVM for Feature Selection In this section we modify the P-SVM method for feature selection such that it can serve as a data preprocessing method in order to improve the generalization performance of subsequent classi cation or regression tasks (see also Hochreiter and Obermayer, 2004b). Due to the property of the P-SVM method to expand w into a sparse set of support features, it can be modi ed to optimally extract a small number of “informative” features from the set of “row” objects. The set of “support features” can then be used as input to an arbitrary predictor, e.g. another P-SVM, a standard SVM or a K-nearest-neighbor classi er. Noisy measurements can lead to spurious mixed moments, i.e. complex fea- tures may contain no information about the objects’ attributes but still exhibit nite values of j. In order to prevent those features to a ect the classi cation
15 boundary or the regression function, we introduce a “correlation threshold” and modify the constraints in eqs. (17) according to
K> X> w y 1 0 , (29) K> X> w y + 1 0 . ¡ ¢ This regularization¡ scheme is¢analogous to the -insensitive loss (Sch olkopf and Smola, 2002). Absolute values of mixed moments smaller than are considered to be spurious, the corresponding features do not in uence the weight vector, because the constraints remain ful lled. The value of directly correlates with the strength of the measurement noise, and can be determined a priori if it is known. If this is not the case, serves as a hyperparameter and its value can be determined using model selection techniques. Note, that data vectors have to be normalized (see Subsection 2.3.1) before applying the P-SVM, because otherwise a global value of would not su ce. Combining eq. (4) and eqs. (29) we then obtain the primal optimization problem 1 min X> w 2 (30) w 2 k k s.t. K> X> w y + 1 0 K> X> w y 1 0 ¡ ¢ for P-SVM feature selection.¡ In order¢ to derive the dual formulation we have to evaluate the Lagrangian:
1 > L = w> X X> w + K> X> w y + 1 (31) 2 > + K> X> ¡w ¢ y¡ ¡ 1 , ¢ ¢ where we have used¡ ¢the¡notation¡ from Section¢ 2.3.3.¢ The vector w is again expressed through the complex features,
w = Zω , (32) and we obtain the dual formulation of eq. (30):
P-SVM optimization problem for feature selection
1 > min + K> K + (33) α+,α 2 ¡y> K +¢ +¡ 1> + +¢ s.t. 0 + , 0 . ¡ ¢ ¡ ¢
The term 1> ( + + ) in this dual objective function enforces a sparse expansion of the weight vector w in terms of the support features. This occurs,
16 because for large enough values of , this term forces all j towards zero except for the complex features which are most relevant for classi cation or regression. If K> K is singular and w is not uniquely determined, enforces a unique solution, which is characterized by the most sparse representation through com- plex features. The dual problem is again solved by a fast Sequential Minimal Optimization (SMO) technique (see Hochreiter and Obermayer, 2004a). Finally, let us address the relationship between the value of a Lagrange j multiplier j and the “importance” of the corresponding complex feature zω for prediction. The change of the empirical error under a change of the weight j vectors by an amount along the direction of a complex feature zω is given by
Remp f j Remp [fw,b] (34) w + zω , b h 2 i 2 2 = + K2 = + | | + , j 2 L ij j 2 L 2 i X because the constraints eq. (29) ensure that j L . If a complex feature zj is completely removed, then = and| | ω j 2 | j | j Remp f j Remp [fw,b] + . (35) w j zω , b L 2 h i The Lagrange parameter j is directly related to the increase in the empirical error when a feature is removed. Therefore, serve as importance measures for the complex features and allows to rank features according to the absolute value of its components. In the following, we illustrate the application of the P-SVM approach to feature selection using a classi cation toy example. The data set consists of 50 “column” objects x, 25 from each class, which are described by 2D-feature vectors x (open and solid circles in Fig. 6). 50 “row” objects z were randomly selected by choosing their 2D-feature vectors z according to an uniform dis- tribution on [ 1.2, 1.2] [ 1.2, 1.2]. K was generated using an RBF kernel i j 1 i j 2 k x , z = exp 2 2 x z with = 0.2. Fig. 6 shows the result of the P-SVM feature selectionk metho dk with a correlation threshold = 20. Only ¡ ¢ ¡ ¢ six features were selected (crosses in Fig. 6), but every group of data points (“column” objects) is described (and detected) by one or two feature vectors (“row” objects). The number of selected features depends on , and on , which determines how the “strength” of a complex feature decreases with the distances xi zj . Smaller or larger would result in a larger number of support kfeatures. k
2.4 The Dot Product Interpretation of Dyadic Data The derivation of the P-SVM method is based on the assumption that the ma- trix K is a dot product matrix whose elements denote a scalar product between the feature vectors which describe the “row” and the “column” objects. If K, however, is a matrix of measured values the question arises under which condi- tions such a matrix can be interpreted as a dot product matrix. In Appendix A we show, that above question reduces to
17 Figure 6: P-SVM method applied to a toy feature selection problem. “Column” and
1 “row” objects are described by 2D-feature vec- tors x and z. 50 “row” feature vectors, 25 0.5 from each class (open / solid circles), were SV: 6, chosen from (( 1, 1), 0.1I), where centers σ: 0.1, : 20 0 are chosen withN equal probability. 100 col- −0.5 umn feature vectors were chosen uniformly from [ 1.2, 1.2]2. K is constructed by an −1 1 i j 2 RBF-kernel exp 2 0.22 x z . Black −1 −0.4 0 0.4 1 crosses indicate P-SVM suppk ort features.k ¡ ¢
(1) Can the set of “column” objects x be completed to a measure space? X (2) Can the set of “row” objects z be completed to a measure space? Z (3) Can the measurement process be expressed via the evaluation of a measurable kernel k (x, z) which is from L2( , )? X Z (1) and (2) can be easily ful lled by de ning a suitable -algebra on the sets; (3) holds if k is bounded and the sets and are compact. Condition (3) equates the evaluation of a kernel as knoXwn fromZstandard SVMs with physical measurements, and physical characteristics of the measurement device deter- mines the properties of the kernel, e.g. boundedness and continuity. But can a measurement process indeed be expressed through a kernel? There is no full answer to this question from a theoretical viewpoint, practical applications have to con rm (or disprove) the chosen ansatz and data model.
3 Numerical Experiments and Applications
In this section we apply the P-SVM method to various kinds of real world data sets and provide benchmark results with previously proposed methods when appropriate. This section consists of three parts which cover classi cation, re- gression, and feature selection. In part one the P-SVM is rst tested as a classi er on data sets from the UCI Benchmark Repository and its performance is compared with results obtained with C- and the -SVMs for di erent kernels. Secondly, we apply the P-SVM to a measured (rather than constructed) pair- wise (“protein”) and a genuine dyadic data set (“World Wide Web”). In part two the P-SVM is applied to regression problems taken from the UCI bench- mark repository and compared with results obtained with C-Support Vector Regression and Bayesian SVMs. Part three describes results obtained for the P-SVM as a feature selection method for microarray data and for the “protein” and “World Wide Web” data sets from part one.
18 3.1 Application to Classi cation Problems 3.1.1 UCI Data Sets In this section we report benchmark results for the data sets “thyroid” (5 fea- tures), “heart” (13 features), “breast-cancer” (9 features), and “german” (20 features) from the UCI benchmark repository, and for the data set “banana” (2 features) taken from (R atsch et al., 2001). All data sets were preprocessed as described in (R atsch et al., 2001) and divided into 100 training/test set pairs. Data sets were generated through resampling where data points were randomly selected for the training set and the remaining data was used for the test set. We downloaded the original 100 training/test set pairs from http://ida.first.fraunhofer.de/projects/bench/. For the data sets “ger- man” and “banana” we restricted the training set to the rst 200 examples of the original training set in order to keep hyperparameter selection feasible but used the original test sets. Pairwise datasets were generated by constructing the Gram matrix for radial basis function (RBF), polynomial (POL), and Plummer (PLU, see Hochreiter et al., 2003) kernels, and the Gram matrices were used as input for kernel Fisher discriminant analysis (KFD, Mika et al., 1999), C-, -, and P-SVM. Because KFD only selects a direction in input space onto which all data points are pro- jected, we must select a classi er for the resulting one-dimensional classi cation task. We follow (Sch olkopf and Smola, 2002) and classify a data point accord- ing to its posterior probability under the assumption of a Gaussian distribution for each label. Hyperparameters (C, , and kernel parameters) were optimized using 5–fold cross validation on the corresponding training sets. To ensure a fair comparison, the hyperparameter selection procedure was equal for all methods, except that the values of the -SVM were selected from 0.1, . . . , 0.9 in con- trast to the selection of C for which a logarithmic scale w{as used. To}test the signi cance of the di erences in generalization performance (percent of misclas- si cations), we calculated, for what percentage of the individual runs (for a total of 100, see below) our method was signi cantly better or worse than others. In order to do so, we rst performed a test for the “di erence of two proportions” for each training/test set pair (Dietterich, 1998). The “di erence of two pro- portions” is the di erence of the misclassi cation rates of two models on the test set, where the models are selected on the training set by the two methods which are to be compared. After this test we adjusted the false discovery rate through the “Benjamini Hochberg Procedure” (Benjamini and Hochberg, 1995) which was recently shown to be correct for dependent outcomes of the tests (Benjamini and Yekutieli, 2001). The fact that the tests can be dependent is important because training and test sets overlap for the di erent training/test set pairs. The false detection rate was controlled at 5 %. We counted for each pair of methods the selected models from the rst method which perform sig- ni cantly (5 % level) better than the selected models from the second method.
Table 1 summarizes the percentage of misclassi cation averaged over 100
19 RBF POL PLU Thyroid C-SVM 5.11 (2.34) 4.51 (2.02) 4.96 (2.35) -SVM 5.15 (2.23) 4.04 (2.12) 4.83 (2.03) KFD 4.96 (2.24) 6.52 (3.18) 5.00 (2.26) P-SVM 4.71 (2.06) 9.44 (3.12) 5.08 (2.18) Heart C-SVM 16.67 (3.51) 18.26 (3.50) 16.33 (3.47) -SVM 16.87 (3.87) 17.44 (3.90) 18.47 (7.81) KFD 17.82 (3.45) 22.53 (3.37) 17.80 (3.86) P-SVM 16.18 (3.66) 16.67 (3.40) 16.54 (3.64) Breast-Cancer C-SVM 26.94 (5.18) 26.87 (4.79) 26.48 (4.87) -SVM 27.69 (5.62) 26.69 (4.73) 30.16 (7.83) KFD 27.53 (4.19) 31.30 (6.11) 27.19 (4.72) P-SVM 26.78 (4.58) 26.40 (4.54) 26.66 (4.59) Banana C-SVM 11.88 (1.11) 12.09 (0.96) 11.81 (1.07) -SVM 11.67 (0.90) 12.72 (2.16) 11.74 (0.98) KFD 12.32 (1.12) 14.04 (3.89) 12.14 (0.96) P-SVM 11.59 (0.96) 14.93 (2.09) 11.52 (0.93) German C-SVM 26.51 (2.60) 27.27 (3.23) 26.88 (3.12) -SVM 27.14 (2.84) 27.13 (3.06) 28.60 (6.27) KFD 26.58 (2.95) 30.96 (3.23) 26.90 (3.15) P-SVM 26.45 (3.20) 25.87 (2.45) 26.65 (2.95)
Table 1: Average percentage of misclassi cation for the UCI and the “banana” data sets. The table compares results obtained with the kernel Fisher discrimi- nant analysis (KFD), C-, -, and P-SVM for the Radial Basis Function (RBF), 1 i j 2 i j exp 2 x x , polynomial (POL), x , x + , and Plummer 2 k k i j (PLU),¡ x x + ¢ kernels. Results¡were av®eraged¢over 100 experi- ments withk separate k training and test sets. For each data set numbers in bold ¡ ¢ and italic highlight the best and the second best result, the numbers in brack- ets denote standard deviations of the trials. C, , and kernel parameters were determined using 5–fold cross validation on the training set and usually di ered between individual experiments.
experiments. Despite the fact that C– and –SVMs are equivalent, results di er because of the somewhat di erent model selection results for the hyper- parameters C and . Best and second best results are indicated by bold and italic numbers. The signi cance tests did not reveal signi cant di erences over the 100 runs in generalization performance for the majority of the cases (for
20 details see http://ni.cs.tu-berlin.de/publications/psvm_sup). The P- SVM with “RBF” and “PLU” kernels, however, never performed signi cantly worse for any of the 100 runs of each data sets than their competitors. Note, that the signi cance tests do not tell whether the averaged misclassi cation rates of one method are signi cantly better/worse than the rates of another method. They provide information on how many runs out of 100 one method was signi cantly better than another method. The UCI-benchmark result shows that the P-SVM is competitive to other state-of-the-art methods for a standard problem setting (measurement matrix equivalent to the Gram matrix). Although the P-SVM method never performed signi cantly worse, it generally required fewer support vectors than other SVM approaches. This was also true for the cases, where the P-SVM’s performance was signi cantly better.
3.1.2 Protein Data Set The “protein” data set (cf. Hofmann and Buhmann, 1997) was provided by M. Vingron and consists of 226 proteins from the globin families. Pairs of proteins are characterized by their evolutionary distance, which is de ned as the probability of transforming one amino acid sequence into the other via point mutations. Class labels are provided, which denote membership in one of the four families: hemoglobin- (“H- ”), hemoglobin- (“H- ”), myoglobin (“M”), and heterogenous globins (“GH”). Table 2 summarizes the classi cation results, which were obtained with the generalized SVM (G-SVM, Graepel et al., 1999; Mangasarian, 1998) and the P-SVM method. Since the G-SVM interprets the columns of the data matrix as feature vectors for the column objects and applies a standard -SVM to these vectors (this is also called “feature map”, Sch olkopf and Smola, 2002), we call the G-SVM simply “ -SVM” in the following. The table shows the percentage of misclassi cation for the four two-class classi cation problems “one class against the rest”. The P-SVM yields better classi cation results although a conservative test for signi cance was not possible due to the small number of data. However, the P-SVM selected 180 proteins as support vectors on average, compared to 203 proteins used by the -SVM (note that for 10–fold cross validation 203 is the average training size). Here a smaller number of support vectors is highly desirable, because it reduces the computational costs of sequence alignments which are necessary for the classi cation of new examples.
3.1.3 World Wide Web Data Set The “World Wide Web” data sets consist of 8,282 WWW-pages collected during the Web Kb project at Carnegie Mellon University in January 1997 from the web sites →of the computer science departments of the four institutions Cornell, Texas, Washington, and Wisconsin. The pages were manually classi ed into the categories “student”, “faculty”, “sta ”, “department”, “course”, “project”, and “other”.
21 protein data Reg. H- H- M GH Size — 72 72 39 30 -SVM 0.05 1.3 4.0 0.5 0.5 -SVM 0.1 1.8 4.5 0.5 0.9 -SVM 0.2 2.2 8.9 0.5 0.9 P-SVM 300 0.4 3.5 0.0 0.4 P-SVM 400 0.4 3.1 0.0 0.9 P-SVM 500 0.4 3.5 0.0 1.3
Table 2: Percentage of misclassi cation for the “protein” data set for classi ers obtained with the P- and -SVM methods. Column “Reg.” lists the values of the regularization parameter ( for -SVM and C for P-SVM). Columns “H- ” to “GH” provide the percentage of misclassi cation for the four problems “one class against the rest” (“size” denote the number of proteins per class), computed using 10–fold cross validation. The best results for each problem are shown in bold. The data matrix was not positive semi-de nite.
Every pair (i, j) of pages is characterized by whether page i contains a hy- perlink to page j and vice versa. The data is summarized using two binary matrices and a ternary matrix. The rst matrix K (“out”) contains a one for at least one outgoing link (i j) and a zero if no outgoing link exists, the sec- ond matrix K> (“in”) contains→ a one for at least one ingoing link (j i) and a 1 > → zero otherwise, and the third, ternary matrix 2 K + K (“sym”) contains a zero, if no link exists, a value of 0.5, if only one unidirectional link exists, and ¡ ¢ a value of 1, if links exists in both directions. In the following, we restricted the data set to pages from the rst six classes which had more than one in- or outgoing link. The data set thus consists of the four subsets “Cornell” (350 pages), “Texas” (286 pages), “Wisconsin” (300 pages), and “Washington” (433 pages). Table 3 summarizes the classi cation results for the C- and P-SVM meth- ods. The parameter C for both SVMs was optimized for each cross valida- tion trial using another 4–fold cross validation on the training set. Signi cance tests were performed to evaluate the di erences in generalization performance using the 10-fold cross-validated paired t-test (Dietterich, 1998). We consid- ered 48 tasks (4 universities, 4 classes, 3 matrices) and checked for each task whether the C-SVM or the P-SVM performed better using a p-value of 0.05. In 30 tasks the P-SVM had a signi cantly better performance than the C- SVM, while the C-SVM was never signi cantly better than the P-SVM (for details see http://ni.cs.tu-berlin.de/publications/psvm_sup). Classi - cation results are better for the asymmetric matrices “in” and “out” than for the symmetric matrix “sym”, because there are cases for which highly indicative pages (hubs) exist which are connected to one particular class of pages by either in- or outgoing links. The symmetric case blurs the contribution of the indica-
22 tive pages because ingoing and outgoing links can no longer be distinguished which leads to poorer performance. Because the P-SVM yields fewer support vectors, online classi cation is faster than for the C-SVM and – if web pages cease to exist – the P-SVM is less likely to be a ected.
3.2 Application to Regression Problems In this section we report results for the data sets “robot arm” (2 features), “boston housing” (13 features), “computer activity” (21 features), and “abalone” (10 features) from the UCI benchmark repository. The data preprocessing is described in (Chu et al., 2004), and the data sets are available as training / test set pairs at http://guppy.mpe.nus.edu.sg/\~{}chuwei/data. The size of the data sets were (training set / test set): “robot arm”: 200 / 200, 1 set; “boston housing”: 481 / 25, 100 sets; “computer activity”: 1000 / 6192, 10 sets; “abalone”: 3000 / 1177, 10 sets. Pairwise data sets were generated by constructing the Gram matrices for Radial Basis Function kernels of di erent widths , and the Gram matrices were used as input for the three regression methods, C-support vector regression (SVR, Sch olkopf and Smola, 2002), Bayesian support vector regression (BSVR, Chu et al., 2004), and the P-SVM. Hyperparameters (C and ) were optimized using n-fold cross-validation (n = 50 for “robot arm”, n = 20 for “boston housing”; n = 4 for “computer activity” and n = 4 for “abalone”). Parameters were rst optimized on a coarse 4 4 grid and later re ned on a 7 7 ne grid around the values for C and selected in the rst step (65 tests per parameter selection). Table 4 shows the mean squared error and the standard deviation of the re- sults. We also performed a Wilcoxon signed rank test to verify the signi cance for these results (for details see http://ni.cs.tu-berlin.de/publications/ psvm_sup), except for the “robot arm” data set, which has only one training/test set pair, and the “boston housing” data set, which contains too few test exam- ples. On “computer activity” SVR was signi cantly better (5 % threshold) than BSVR, and on both data sets “computer activity” and “abalone”, SVR and BSVR were signi cantly outperformed by the P-SVM which in addition used fewer support vectors than its competitors.
3.3 Application to Feature Selection Problems In this section we apply the P-SVM to feature selection problems of various kinds, using the “correlation threshold” regularization scheme (Section 2.3.5). We rst reanalyze the “protein” and “world wide web” data sets of sections 3.1.3 and 3.1.2 and then report results on three DNA microarray data sets. Further feature selection results can be found in (Hochreiter and Obermayer, 2005) where also results for the NIPS 2003 feature selection challenge are reported and where the P-SVM was the best stand-alone method for selecting a compact feature set, and in (Hochreiter and Obermayer, 2004b), where details of the microarray datasets benchmarks are reported.
23 Course Faculty Project Student Cornell University Size 57 60 52 143 C-SVM (Sym) 11.1 (6.2) 19.7 (5.3) 13.7 (4.0) 50.0 (11.5) C-SVM (Out) 12.6 (3.1) 15.1 (6.0) 10.6 (4.9) 22.3 (10.3) C-SVM (In) 11.1 (4.9) 21.4 (4.3) 14.6 (5.5) 48.9 (15.5) P-SVM (Sym) 12.3 (3.3) 17.1 (6.2) 15.4 (6.3) 19.1 (5.7) P-SVM (Out) 8.6 (3.8) 14.3 (6.3) 8.3 (4.9) 16.9 (7.9) P-SVM (In) 7.1 (4.1) 13.7 (6.6) 10.9 (5.5) 17.1 (5.7) Texas University Size 52 35 29 129 C-SVM (Sym) 17.2 (9.0) 22.0 (9.1) 19.8 (6.7) 53.5 (11.8) C-SVM (Out) 9.5 (5.1) 16.5 (5.8) 20.2 (8.7) 28.9 (11.7) C-SVM (In) 12.6 (4.9) 20.6 (7.8) 20.9 (5.1) 16.4 (7.6) P-SVM (Sym) 15.8 (5.8) 13.6 (7.3) 12.2 (6.2) 25.5 (6.9) P-SVM (Out) 8.1 (7.6) 9.8 (3.6) 9.8 (3.9) 20.9 (6.7) P-SVM (In) 12.3 (5.6) 10.5 (6.3) 9.4 (4.6) 13.0 (5.0) Wisconsin University Size 77 36 22 117 C-SVM (Sym) 27.0 (10.0) 22.0 (5.5) 14.0 (6.4) 49.3 (11.1) C-SVM (Out) 19.3 (7.5) 16.0 (3.8) 10.3 (4.8) 34.3 (10.5) C-SVM (In) 22.0 (8.6) 16.3 (5.8) 7.7 (4.5) 24.3 (9.9) P-SVM (Sym) 18.7 (4.5) 15.0 (9.3) 10.0 (5.4) 34.3 (8.6) P-SVM (Out) 12.0 (5.5) 11.3 (6.5) 7.7 (4.2) 23.7 (4.8) P-SVM (In) 13.3 (4.4) 8.7 (8.2) 6.3 (7.1) 13.3 (5.9) Washington University Size 169 44 39 151 C-SVM (Sym) 19.6 (6.8) 18.7 (6.8) 10.6 (3.5) 43.6 (8.3) C-SVM (Out) 10.6 (4.6) 14.1 (3.0) 14.3 (4.8) 28.2 (9.8) C-SVM (In) 20.3 (6.4) 20.4 (5.3) 13.8 (4.7) 38.3 (11.9) P-SVM (Sym) 17.1 (4.4) 13.4 (6.6) 8.8 (2.1) 20.3 (6.8) P-SVM (Out) 10.6 (5.2) 12.7 (2.9) 6.7 (3.4) 17.1 (4.3) P-SVM (In) 11.8 (5.6) 9.2 (6.2) 6.7 (2.0) 14.3 (6.9)
Table 3: Percentage of misclassi cation for the World Wide Web data sets for classi ers obtained with the P- and C-SVM methods. The percentage of misclassi cation was measured using 10–fold cross-validation. The best and second best results for each data set and classi cation task are indicated in bold and italics; numbers in brackets denote standard deviations of the trials.
3.3.1 Protein and World Wide Web Data Sets In this section we again apply the P-SVM to the “protein” and “world wide web” data sets of sections 3.1.2 and 3.1.3. Using both regularization schemes
24 robot boston computer arm (10 3) housing activity abalone
SVR 5.84 10.27 (7.21) 13.80 (0.93) 0.441 (0.021) BSVR 5.89 12.34 (9.20) 17.59 (0.98) 0.438 (0.023) P-SVM 5.88 9.42 (4.96) 10.28 (0.47) 0.424 (0.017)
Table 4: Regression results for the UCI data sets. The table shows the mean squared error and its standard deviation in brackets. Best results for each data set are shown in bold. For the “robot arm” data only one data set was available and, therefore, no standard deviation is given.
simultaneously leads to a trade-o between a small number of features (a small number of measurements) and better classi cation result. Reducing the number of features is bene cial if measurements are costly and if a small increase in prediction error can be tolerated. Table 5 shows the results for the “protein” data sets for various values of the regularization parameter . C was set to 100, because it gave good results for a wide range of values. We chose a minimal = 0.2 because it resulted in a classi er, where all complex features were support vectors. The size of the training set is 203. Note, that C was smaller than in the experiments in Section 3.1.2 because large values of push the dual variables towards zero and large values of C have no in uence. The table shows that classi cation performance drops if less features are considered, but that 5 % of the features su ce to obtain a performance which lead only to about 5 % misclassi cations compared to about 2 % at the optimum. Since every feature value has to be determined via a sequence alignment, this saving in computation time is essential for large data bases like the Swiss-Prot data base (130,000 proteins), where supplying all pairwise relations is currently impossible.
protein data H- H- M GH 0.2 1.3 (203) 4.9 (203) 0.9 (203) 1.3 (203) 1 2.6 (41) 5.3 (110) 1.3 (28) 4.4 (41) 10 3.5 (10) 8.8 (26) 1.8 (5) 13.3 (7) 20 3.5 (5) 8.4 (12) 4.0 (4) 13.3 (5)
Table 5: Percentage of misclassi cation and the number of support features (in brackets) for the “protein” data set for the P-SVM method. The total number of features is 226. C was 100. The ve columns (left right) show the values for and the results for the four classi cation problems→ “one class against the rest” using 10–fold cross-validation.
Table 6 shows the corresponding results (10–fold cross validation) for the
25 P-SVM applied to the “world wide web” data set “Cornell” and for the classi- cation problem “student pages vs. the rest”. Only ingoing links (matrix K > of Section 3.1.3) were used. C was optimized using 3–fold cross validation on the corresponding training sets for each of the 10–fold cross validation runs. By increasing the regularization parameter the number of web pages which have to be considered in order to classify a new page (the number of support vectors) decreases from 135 to 8. At the same time the percentage of pages which can no longer be classi ed because they receive no ingoing link from one of the “support vector page” increases. The percentage of misclassi cation, however, is reduces from 14 % for = 0.1 to 0.6 % for = 2.0. With only 8 pages providing ingoing links more than 50 % of the pages could be classi ed with only 0.6 % misclassi cation rate.
“Cornell” data set, student pages % cl. % mis. # (%) SVs % cl. % mis. # (%) SVs 0.1 84 14 135 (38.6) 0.8 65 3.1 34 (9.7) 0.2 81 12 115 (32.8) 0.9 64 2.7 32 (9.1) 0.3 79 9.7 99 (28.3) 1.0 61 1.4 27 (7.7) 0.4 75 6.9 72 (20.6) 1.1 59 1.0 21 (6.0) 0.5 73 5.5 58 (16.6) 1.4 56 1.0 12 (3.4) 0.6 71 4.8 48 (13.7) 1.6 55 1.0 10 (2.8) 0.7 66 3.9 38 (10.9) 2.0 51 0.6 8 (2.3)
Table 6: P-SVM feature selection and classi cation results (10–fold cross vali- dation) for “world wide web” data set “Cornell” and the classi cation problem “student pages against the rest”. The columns show (left right): the value , the percentage of classi ed pages (“cl.”), the percentage →of misclassi cations (“mis.”) and the number (and percentage) of support vectors. C was obtained by a 3-fold cross validation on the corresponding training sets.
3.3.2 DNA Micorarray Data Sets In this subsection we describe the application the P-SVM to DNA microarray data. The data was taken from Pomeroy et al. (2002) (brain tumor), Shipp et al. (2002) (lymphoma tumor), and van’t Veer et al. (2002) (breast cancer). All three data sets consist of tissue samples from di erent patients which were charac- terized by the expression values of a large number of genes. All samples were labeled according to the outcome of a particular treatment (positive/negative) and the task was to predict the outcome of the treatment for a new patient. We compared the classi cation performance of the following combinations of a feature selection and a classi cation method:
26 feature selection classi cation (1) expression value of the TrkC gene one gene classi cation (2) SPLASH (Califano et al., 1999) likelihood ratio classi er (LRC) (3) signal-to-noise-statistics (STN) K-nearest neighbor (KNN) (4) signal-to-noise-statistics (STN) weighted voting (voting) (5) Fisher statistics (Fisher) weighted voting (voting) (6) R2W2 R2W2 (7) P-SVM -SVM
The P-SVM/ -SVM results are taken from (Hochreiter and Obermayer, 2004b), where the details concerning the data sets and the gene selection procedure based on the P-SVM can be found, and the results for the other combinations wer taken from the references cited above. All results are summarized in Table 7. The comparison shows that the P-SVM method outperforms the standard methods.
Brain Tumor Lymphoma
Feature Selection / # # Feature Selection / # # Classi cation F E Classi cation F E TrkC gene 1 33 STN / KNN 8 28 SPLASH / LRC – 25 STN / voting 13 24 R2W2 – 25 R2W2 – 22 STN / voting – 23 P-SVM / -SVM 18 21 STN / KNN 8 22 TrkC & SVM & KNN – 20 P-SVM / -SVM 45 7
Breast Cancer
Feature Selection / # # ROC Classi cation F E area Fisher / voting 70 26 0.88 P-SVM / -SVM 30 15 0.77
Table 7: Results for the three DNA microarray data sets “brain tumor”, “lym- phoma”, and “breast cancer” (see text). The table shows the leave-one-out classi cation error E (% misclassi cations) and the number F of selected fea- tures. For breast cancer only the minimal value of E for di erent thresholds was available, therefore the area under a receiver operating curve is provided in addition. Best results are shown in bold.
27 4 Summary
In this contribution we have described the Potential Support Vector Machine (P-SVM) as new method for classi cation, regression, and feature selection. The P-SVM selects models using the principle of structural risk minimization. In contrast to standard SVM approaches, however, the P-SVM is based on a new objective function and a new set of constraints which lead to an expansion of the classi cation or regression function in terms of “support features”. The optimization problem is quadratic, always well de ned, suited for dyadic data, and neither requires square nor positive de nite Gram matrices. Therefore, the method can also be used without preprocessing with matrices which are mea- sured and with matrices which are constructed from a vectorial representation using an inde nite kernel function. In feature selection mode the P-SVM al- lows to select and rank the features through the support vector weights of its sparse set of support vectors. The sparseness constraint avoids sets for features, which are redundant. In a classi cation or regression setting this is an advan- tage over statistical methods where redundant features are often kept as long as they provide information about the objects’ attributes. Because the dual for- mulation of the optimization problem can be solved by a fast SMO technique, the new P-SVM can be applied to data sets with many features. The P-SVM approach was compared with several state-of-the-art classi cation, regression and feature selection methods. Whenever signi cance tests could be applied, the P-SVM never performed signi cantly worse than its “competitor”, in many cases it performed signi cantly better. But even if no signi cant improvement in prediction error could be found, the P-SVM needed less “support features”, i.e. less measurements, for evaluating new data objects. Finally, we have suggested a new interpretation of dyadic data, where objects in real world are not described by vectors. Structures like dot products are induced directly through measurements of object pairs, i.e. through relations between objects. This opens up a new eld of research where relations between real world objects determine mathematical structures.
Acknowledgments We thank M. Albery-Speyer, C. Busc her, C. Minoux, R. Sanyal, and S. Seo for their help with the numerical simulations. This work was funded by the Anna-Geissler- and the Monika-Kuntzner-Stiftung.
A Measurements, Kernels, and Dot Products
In this appendix we address the question under what conditions a “measurement kernel” which gives rise to a measured matrix K can be interpreted as a dot product between feature vectors describing the “row” and “column” objects of a “dyadic data” set. Assume that “column” objects x (“samples”) and “row objects” z (“complex features”) are from sets and , which can both be X Z
28 completed by a -algebra and a measure to a measurable spaces. Let ( , B, ) be a measurable space with -algebra B and a -additive measure onUthe set . We consider functions f: R on the set . A function f is called - measurableU on ( , B) if f 1 ([a,U b→]) B for all a,Ub R, and -integrable if f d < . UWe de ne ∈ ∈ U ∞ 1 R 2 2 f 2 := f d (36) k kL µZU ¶ and the set
2 L ( ) := f : R; f is -measurable and f 2 < . (37) U U → k kL ∞ 2 n o L ( ) is a Banach space with norm 2 . If we de ne the dot product U k kL
f, g 2 := f g d (38) h iL (U) ZU 2 then the Banach space L ( ) is a Hilbert space with a dot product , 2 and U h iL (U) 2 2 scalar body R. For simplicity, we denote this Hilbert space by L ( ). L ( 1, 2) is the Hilbert space of functions k with k2 (u , u ) d (uU ) d (Uu )U< U1 U2 1 2 2 1 using the product measure of (U1 U2) = (U1) (U2). With these ∞ 2 R R 2 2 de nitions we see that H1 := L ( ), H2 := L ( ), and H3 := L ( , ) are Hilbert spaces of L2-functions withZ domains , X, and , respectivX elyZ . X Z X Z The dot product in Hi is denoted by , Hi . Note, that for discrete or the respective integrals can be replaced h by isums (using a measure of DiracX Zdelta functions at the discrete points, see Werner, 2000, p. 464, ex. (c)). Let us now assume that k H . k induces a Hilbert-Schmidt operator T : ∈ 3 k
f(x) = (Tk )(x) = k(x, z) (z) d (z) , (39) ZZ which maps H1 (a parameterization) to f H2 (a classi er). If we set P ∈ j ∈ (z) = j=1 z , we recover the P-SVM classi cation function (without b), eq. (27), P ¡ ¢ P P j f(u) = j k u, z = j K(u)j , (40) j=1 j=1 X ¡ ¢ X j j j where j = (z ) and z is the Dirac delta function at location z . The next theorem provides the conditions a kernel must ful ll in oder to be interpretable as a dot product¡ ¢ between the objects’ feature vectors. Theorem 1 (Singular Value Expansion) Let be from H1 and let k be a kernel from H3 which de nes a Hilbert-Schmidt operator T : H H k 1 → 2
(Tk ) (x) = f(x) = k(x, z) (z) dz . (41) ZZ
29 Then
f 2 = T T , (42) k kH2 h k k iH1 where Tk is the adjoint operator of Tk and there exists an expansion
k(x, z) = sn en(z) gn(x) (43) n X 2 which converges in the L -sense. The sn 0 are the singular values of Tk, and e H , g H are the corresponding orthonormal functions. For = n ∈ 1 n ∈ 2 X Z and symmetric, positive de nite kernel k, we obtain the eigenfunctions en = gn of Tk with corresponding eigenvalues sn. Proof. From f = Tk we obtain f 2 = T , T = T T , . (44) k kH2 h k k iH2 h k k iH1 The singular value expansion of Tk is
T = s , e g (45) k nh niH1 n n X (see Werner, 2000, Theorem VI.3.6). The values sn are the singular values of T for the orthonormal systems e on H and g on H . We de ne k { n} 1 { n} 2 r := T e , g = s , (46) nm h k n miH2 nm m where the last “=” results from eq. (45) for = en. The sum
r2 = ( T e , g )2 T e 2 < (47) nm h k n miH2 k k nkH2 ∞ m m X X converges because of Bessel’s inequality (the -sign). Next we complete the or- thonormal system (ONS) en to an orthonormal basis (ONB) e˜l by adding an ONB of the kernel ker ({T )} of the operator T to the ONS e{ }. The func- k k { n} tion H1 possesses an unique representation through this basis: = , e˜ ∈ e˜ . We obtain lh liH1 l P T = , e˜ T e˜ , (48) k h liH1 k l Xl where we used that Tk is continuous. Because Tke˜l = 0 for all e˜l ker (Tk), the image T can be expressed through the ONS e : ∈ k { n} T = , e T e (49) k h niH1 k n n X = , e T e , g g = r , e g . h niH1 h k n miH2 m nmh niH1 m n à m ! n,m X X X
30 Here we used the fact that gm is an ONB of the range of Tk and, therefore, T e = T e , g g{ . } k n mh k n miH2 m Because the set of functions en gm are an ONS in H3 (which can be completed to an ONB)P and r2 {< }(cf. eq. (47)), the kernel n,m nm ∞
k˜(z, x) := P rnm en(z) gm(x) (50) n,m X is from H3. We observe that the induced Hilbert-Schmidt operator Tk˜ is equal to Tk:
T˜ (x) = r , e g (x) = (T )(x) , (51) k nmh niH1 m k n,m ¡ ¢ X where the rst “=”-sign follows from eq. (50) and the second “=”-sign from eq. (49). Hence, the kernel k and kernel k˜ are equal except for a set with ˜ zero measure, i.e. k = k. We obtain from eq. (46) Tkel, gt H1 = lt sl and T e , g = r from eq. (51), and, therefore, hr = i s . Inserting h k l tiH1 lt lt lt l rnm = nmsn into eq. (50) proves eq. (43) in the theorem. The last statement of the theorem follows from the fact that Tk = 1/2 | | (Tk Tk) = Tk (Tk is positive and selfadjoint) and, therefore, en = gn (Werner, 2000, proof of Theorem VI.3.6, p. 246 top). ¥ As a consequence of this theorem, we can for nite de ne a mapping ω of “row” objects z and a mapping “column” objects Zx into a common feature space where k is a dot product.