Support Vector Machines for Dyadic Data

Sepp Hochreiter and Klaus Obermayer Department of Electrical Engineering and Computer Science Technische Universitat 10587 Berlin, {hochreit,oby}@cs.tu-berlin.de

Abstract We describe a new technique for the analysis of dyadic data, where two sets of objects (“row” and “column” objects) are characterized by a matrix of numerical values which describe their mutual relationships. The new technique, called “Potential Support Vector Machine” (P-SVM), is a large-margin method for the construction of classiers and regression functions for the “column” objects. Contrary to standard support vec- tor machine approaches, the P-SVM minimizes a scale-invariant capacity measure and requires a new set of constraints. As a result, the P-SVM method leads to a usually sparse expansion of the classication and regres- sion functions in terms of the “row” rather than the “column” objects and can handle data and kernel matrices which are neither positive denite nor square. We then describe two complementary regularization schemes. The rst scheme improves generalization performance for classication and regression tasks, the second scheme leads to the selection of a small, informative set of “row” “support” objects and can be applied to . Benchmarks for classication, regression, and feature selection tasks are performed with toy data as well as with several real world data sets. The results show, that the new method is at least competitive with but often performs better than the benchmarked standard methods for standard vectorial as well as for true dyadic data sets. In addition, a theoretical justication is provided for the new approach.

1 Introduction

Learning from examples in order to predict is one of the standard tasks in ma- chine learning. Many techniques have been developed to solve classication and regression problems, but by far most of them were specically designed for vectorial data. Vectorial data, where data objects are described by vectors of numbers which are treated as elements of a vector space, are very convenient, because of the structure imposed by the Euclidean metric. However, there are

1 datasets for which a vector-based description is inconvenient or simply wrong, and representations which consider relationships between objects, are more ap- propriate.

A) a b c d B) a b c d e f g

α 1.3 −2.2 −1.6 7.8 a 0.9 −0.1 −0.8 0.5 0.2 −0.5 −0.7

β −1.8 −1.1 7.2 2.3 b −0.1 0.9 0.6 0.3 −0.7 −0.6 0.3

χ 1.2 1.9 −2.9 −2.2 c −0.8 0.6 0.9 0.2 −0.6 0.6 0.5

δ 3.7 0.8 −0.6 2.5 d 0.5 0.3 0.2 0.9 0.7 0.1 0.3

ε 9.2 −9.4 −8.3 9.2 e 0.2 −0.7 −0.6 0.7 0.9 −0.9 −0.5

φ −7.7 8.6 −9.7 −7.4 f −0.5 −0.6 0.6 0.1 −0.9 0.9 0.9

γ −4.8 0.1 −1.2 0.9 g −0.7 0.3 0.5 0.3 −0.5 0.9 0.9

Figure 1: A) Dyadic data: Column objects a, b, . . . and row objects , , . . . are represented by a matrix of numerical {values whic} h describe their{ mutual} relationships. B) Pairwise data: A special case of dyadic data, where row and column objects coincide.

In the following we will study representations of data objects which are based on matrices. We consider “column” objects, which are the objects to be described, and “row” objects, which are the objects which serve for their description (cf. Fig. 1A). The whole dataset can then be represented using a rectangular matrix whose entries denote the relationships between the corre- sponding “row” and “column” objects. We will call representations of this form dyadic data (Hofmann and Puzicha, 1998; Li and Loken, 2002; Ho, 2005). If “row” and “column” objects are from the same set (Fig. 1B), the representa- tion is usually called pairwise data, and the entries of the matrix can often be interpreted as the degree of similarity (or dissimilarity) between objects. Dyadic descriptions are more powerful than vector-based descriptions, as vectorial data can always be brought into dyadic form when required. This is often done for kernel-based classiers or regression functions (Scholkopf and Smola, 2002; Vapnik, 1998), where a Gram matrix of mutual similarities is calculated before the predictor is learned. A similar procedure can also be used in cases where the “row” and “column” objects are from dierent sets. If both of them are described by feature vectors, a matrix can be calculated by applying a kernel function to pairs of feature vectors, one vector describing a “row” and the other vector describing a “column” object. One example for this is the drug-gene matrix of Scherf et al. (2000), which was constructed as the product of a measured drug-sample and a measured sample-gene matrix and where the kernel function was a scalar product.

2 In many cases, however, dyadic descriptions emerge, because the matrix en- tries are measured directly. Pairwise data representations as a special case of dyadic data can be found for datasets where similarities or distances between objects are measured. Examples include sequential or biophysical similarities between proteins (Lipman and Pearson, 1985; Sigrist et al., 2002; Falquet et al., 2002), chromosome location or co-expression similarities between genes (Cre- mer et al., 1993; Lu et al., 1994; Heyer et al., 1999), co-citations of text docu- ments (White and McCain, 1989; Bayer et al., 1990; Ahlgren et al., 2003), or hyperlinks between web-pages (Kleinberg, 1999). In general, these measured matrices are symmetric but may not be positive denite, and even if they are for the training set, they may not remain positive denite, if new examples are included. Genuine dyadic data occur whenever two sets of objects are related. Examples are DNA microarray data (Southern, 1988; Lysov et al., 1988; Dr- manac et al., 1989; Bains and Smith, 1988), where the “column” objects are tissue samples, the “row” objects are genes, and every sample-gene pair is re- lated by the expression level of this particular gene in this particular sample. Other examples are web-documents, where both the “column” and “row” ob- jects are web-pages and “column” web-pages are described by either hyperlinks to or from the “row” objects, which give rise to a rectangular matrix1. Fur- ther examples include documents in a database described by word-frequencies (Salton, 1968) or molecules described by transferable atom equivalent (TAE) descriptors (Mazza et al., 2001). Traditionally, “row” objects have often been called “features” and “column” vectors of the dyadic data matrix have mostly been treated as “feature vectors” which live in an Euclidean vector space, even when the dyadic nature the data was made explicit, see (Graepel et al., 1999; Mangasarian, 1998) or the “feature map” method (Scholkopf and Smola, 2002). Diculties, however, arise when features are heterogeneous, and “apples and oranges” must be compared. What theoretical arguments would, for example, justify, to treat the values of a set of TAE descriptors as coordinates of a vector in an Euclidean space? A “non-vectorial” approach to pairwise data is to interpret the data matrix as a Gram matrix and to apply support vector machines (SVM) for classication and regression if the data matrix is positive semidenite (Graepel et al., 1999). For indenite (but symmetric) matrices two other non-vectorial approaches have been suggested (Graepel et al., 1999). In the rst approach, the data matrix is projected into the subspace spanned by the eigenvectors with positive eigenval- ues. This is an approximation and one would expect good results only, if the absolute values of the negative eigenvalues are small compared to the dominant positive ones. In the second approach, directions of negative eigenvalues are processed by ipping the sign of these eigenvalues. All three approaches lead to positive semidenite matrices on the training set, but positive semideniteness is not assured, if a new test object must be included. An embedding approach was suggested by Herbrich et al. (1998) for antisymmetric matrices, but this was

1Note, that for pairwise data examples the linking matrix was symmetric because links were considered bidirectional.

3 specically designed for data sets, where the matrix entries denote preference relations between objects. In summary, no general and principled method exists to learn classiers or regression functions for dyadic data. In order to avoid abovementioned shortcomings, we suggest to consider “col- umn” and “row” objects on an equal footing and interpret the matrix entries as the result of a kernel function or measurement kernel, which takes a “row” object, applies it to a “column” object, and outputs a number. It will turn out that mild conditions on this kernel suce to create a vector space endowed with a dot product into which the “row” and the “column” objects can be mapped. Using this mathematical argument as a justication, we show how to construct classication and regression functions in this vector space in analogy to the large margin based methods for learning perceptrons for vectorial data. Using an im- proved measure for model complexity and a new set of constraints which ensure a good performance on the training data, we arrive at a generally applicable method to learn predictors for dyadic data. The new method is called the “po- tential support vector machine” (P-SVM). It can handle rectangular matrices as well as pairwise data whose matrices are not necessarily positive semidenite, but even when the P-SVM is applied to regular Gram matrices, it shows very good results when compared with standard methods. Due to the choice of con- straints, the nal predictor is expanded into a usually sparse set of descriptive “row” objects, which is dierent from the standard SVM expansion in terms of the “column” objects. This opens up another important application domain: a sparse expansion is equivalent to feature selection (Guyon and Elissee, 2003; Hochreiter and Obermayer, 2004b; Kohavi and John, 1997; Blum and Langley, 1997). An ecient implementation of the P-SVM using a modied sequential minimal optimization procedure for learning is described in (Hochreiter and Obermayer, 2004a).

2 The Potential Support Vector Machine 2.1 A Scale-Invariant Objective Function Consider a set xi 1 i L of objects which are described by feature | X vectors xi RN and which form a training set X = x1 , . . . , xL . The index ∈ © ª “” is introduced, because we will later use the kerneln trick andoassume that i the vectors x are images of a map which is induced either by a kernel or by a measurement function. Assume for the moment a simple binary classication problem, where class membership is indicated by labels yi +1, 1 , , denotes the dot product, and a linear classier parameterized∈ {by the }weighh it vector w and the oset b is chosen from the set

sign (f(x )) f(x ) = w, x + b . (1) { | h i }

4 Standard SVM-techniques select the classier with the largest margin under the constraint of correct classication on the training set: 1 min w 2 (2) w,b 2 k k s.t. y w, xi + b 1 . i If the training data ¡are­ not®linearly¢ separable, a large margin is traded against a small training error using a suitable regularization scheme. The maximum margin objective is motivated by bounds on the generalization error using the Vapnik-Chervonenkis (VC) dimension h as capacity measure (Vapnik, 1998). For the set of all linear classiers dened on X, for which 2 2 min holds, one obtains h min R /min , N + 1 (see Vapnik, 1998; Scholkopf and Smola, 2002).[ ] denotes the integer part, is the margin, and R is the radius of the smallest sphere ©£in data space,¤ whicª h contains all the training data. Bounds derived using the fat shattering dimension (Shawe-Taylor et al., 1996, 1998; Scholkopf and Smola, 2002), and bounds on the expected generalization error (cf. Vapnik, 1998; Scholkopf and Smola, 2002) depend on R/min in a similar manner. Both the selection of a classier using the maximum margin principle and the values obtained for the generalization error bounds suer from the problem that

they are not invariant under linear transformations. This problem is illustrated

¡

¢¡

¥

¢¥

¤

£ £

Figure 2: LEFT: data points from two classes (triangles and circles) are sepa- rated by the hyperplane with the largest margin (solid line). The two support vectors (black symbols) are separated by dx along the horizontal and by dy 2 2 2 2 2 2 2 2 along the vertical axis, 4 = dx + dy and R / 4 = R / dx + dy . The dashed line indicates the classication boundary of the classier shown on the right, scaled along the vertical axis by the¡factor¢ 1/s. RIGHT:¡ the¢ same data but scaled along the vertical axis by the factor s. The solid 2 2 2 2 line denotes the maximum margin hyperplane, 4 = dx + s dy and 2 2 2 2 2 2 R / 4 = R / dx + s dy . For dy = 0 both the margin and the term R2/2 depend on s. 6 ¡ ¢ ¡ ¢

5 in Fig. 2 for a 2D classication problem. In the left gure, both classes are separated by the hyperplane with the largest margin (solid line). In the right gure, the same classication problem is shown, but scaled along the vertical axis by a factor s. Again, the solid line denotes the support vector solution, but when the classier is scaled back to s = 1 (dashed line in the left gure) it does no longer coincide with the original SVM solution. The ratio R2/2, which bounds the VC dimension, also depends on the scale factors (see legend of Fig. 2). The situation depicted in Fig. 2 occurs whenever the data can be enclosed by a (non-spherical) ellipsoid (cf. Scholkopf et al., 1999). The considerations of Fig. 2 show, that the optimal hyperplane is not scale invariant and predictions of class labels may change if the data is rescaled before learning. Thus the question arises, which scale factors should be used for classier selection. Here we suggest to scale the training data such that the margin remains constant while the radius R of the sphere containing all training data becomes as small as possible. The result is a new sphere with radius R˜ which leads to a tighter margin-based bound for the generalization error. Optimality is achieved when all directions orthogonal to the normal vector w of the hyperplane with maximal margin are scaled to zero and R˜ = min R max wˆ, xi + t t∈ i i ¯D E i¯ maxi wˆ, x , where wˆ := w/ w . If t is small compared¯ to wˆ, x¯ , k k | | ¯ ¯ e.g. if¯Dthe dataE¯ is centered at the origin, t can be neglected through¯D aboEv¯e ¯ ¯ ¯ ¯ inequalit¯ y. Unfortunately¯ , above formulation does not lead to an optimization¯ ¯ problem which is easy to implement. Therefore, we suggest to minimize the upper bound: ˜ 2 R ˜ 2 2 i 2 i 2 > 2 = R w max w, x w, x = X w , (3) 2 k k i i ­ ® X ­ ® ° ° ° ° 1 2 L i where the matrix X := x, x, . . . , x contains the training vectors x. In the second inequality the³maximum norm´ is bounded by the Euclidean norm. Its worst case factor is L, but the bound is tight. The new objective is well > > dened also for cases where X X or/and X X is singular, and the kernel trick carries over to the new technique. It can be shown that replacing the objective function w 2 (eqs. (2)) by the upper bound k k

> > > 2 w X X w = X w (4)

R˜ 2 ° ° on 2 , eq. (3), corresponds° to the° integration of sphering (whitening) and SVM learning if the data have zero mean. If the data has already been sphered, then > the covariance matrix is given by X X = I and we recover the classical SVM. If not, minimizing the new objective leads to normal vectors which are rotated towards directions of low variance of the data when compared with the standard maximum margin solution. Note, however, that whitening can easily be performed in input space but becomes nontrivial if the data is mapped to a high-dimensional feature space

6 using a kernel function. In order to test, whether the situation depicted in Fig. 2 actually occurs in practice, we applied a C-SVM with an RBF-kernel to the UCI data set “breast-cancer” from Subsection 3.1.1. The kernel width was chosen small and C large enough, so that no error occured on the training set (to be consistent with Fig. 2). Sphering was performed in feature space by replacing the objective function in eq. (2) with the new objective, eq. (4). Fig. 3 shows the angle

w , w = arccos h svm sphi = (5) Ã wsvm, wsvm wsph, wsph ! h i h i p psvm sph i j i,j i j yi k x , x arccos   svm svm i j sph sph i j i,j i j Pyi yj k (x , x ) ¡ i,j i¢ j k (x , x )   q q  between the hyperplaneP for the sphered (“sph”) and theP non-sphered (“svm”) case as a function of the width of the RBF-kernel. The observed values range between 6 and 50 providing evidence for the situation shown in Fig. 2. 50 45 Figure 3: Application of a C-SVM 40 to the data set “breast-cancer” of the 35 UCI repository. The angle (eq. (5)) 30 between weight vectors wsvm and wsph 25 is plotted as a function of (C = 20 in degree 1000). For 0 the data is al- ψ 15 ready approximately→ sphered in fea- 10 ture space, hence the objective func- 5 0.5 1 1.5 2 2.5 3 3.5 4 tions from eqs. (2) and (4) lead to sim- kernel width σ ilar results.

The new objective function, eq. (4), leads to separating hyperplanes which are invariant under linear transformations of the data. As a consequence, neither the bounds nor the performance of the derived classier depends on how the training data was scaled. But is the new objective function also related to a capacity measure for the model class as the margin is? It is, and in (Hochreiter and Obermayer, 2004c) it has been shown, that the capacity measure, eq. (4), emerges when a bound for the generalization error is constructed using the technique of covering numbers.

2.2 Constraints for Complex Features The next step is to formulate a set of constraints which enforce a good perfor- mance on the training set and which implement the new idea that “row” and “column” objects are both mapped into a Hilbert space within which matrix en- tries correspond to scalar products and the classication or regression function is constructed.

7 Assume that each matrix entry was measured by a device which determines the projection of an object feature vector x onto a direction zω in feature j i space. The value Kij of such a complex feature zω for an object x is then given by the dot product

i j Kij = x, zω . (6)

In analogy to the­ index® “” for x, the index “ω” indicates that we will later i RN assume that the vectors zω are images in of a map ω which is induced by either a kernel or a measurement function. A mathematical foundation of the ansatz eq. (6) is given in Appendix A. Assume that we have P devices whose corresponding vectors zω are the 1 2 P only measurable and accessible directions and let Zω := zω, zω, . . . , zω be the matrix of all complex features. Then we can summarize our (incomplete) ¡ ¢ knowledge about the objects in X using the data matrix

> K = X Zω . (7) For DNA microarray data, for example, we could identify K with the matrix of expression values obtained by a microarray experiment. For web data we could identify K with the matrix of ingoing or outgoing hyperlinks. For a document data set we could identify K with the matrix of word frequencies. Hence we assume, that x and zω live in a space of hidden causes which are responsible for the dierent attributes of the objects. The complex features j zω span a subspace of the original feature space, but we do not require them j j to be orthogonal, normalized, or linearly independent. If we set zω = e (jth © ª i Cartesian unit vector), that is Zω = I, Kij = xj and P = N, the “new” description, eq. (7), is fully equivalent to the “old” description using the original feature vectors x. We now dene the quality measure for the performance of the classier or the regression function on the training set. We consider the quadratic loss function 1 c(y , f(xi )) = r2 , r = f(xi ) y = w, xi + b y . (8) i 2 i i i i The the mean squared error is ­ ®

1 L R [f ] = c y , f(xi ) . (9) emp w,b L i i=1 X ¡ ¢ We now require, that the selected classication or regression function minimizes Remp, i.e. that

1 > wR [fw ] = X X w + b1 y = 0 (10) ∇ emp ,b L and ¡ ¢ ∂R [f] 1 1 emp = r = b + w, xi y = 0 , (11) ∂b L i L i i i X X ¡­ ® ¢ 8 where the labels for all objects in the training set are summarized by a label vector y. Since the quadratic loss function is convex in w and b, only one > > minimum exists if X X has full rank. If X X is singular, then all points of minimal value correspond to a subspace of RN . From eq. (11) we obtain

1 L 1 b = w, xi y = w>X y> 1 . (12) L i L i=1 X ¡­ ® ¢ ¡ ¢ Condition eq. (10) implies, that the directional derivative should be zero along any direction in feature space, including the directions of the complex feature vectors zω. We, therefore, obtain

dRemp f j w + t zω ,b j > = z wR [fw ] (13) h dt i ω ∇ emp ,b 1 > ¡ ¢ = zj X X>w + b1 y = 0 , L ω and, combining all¡ complex¢ features,¡ ¢ 1 1 Z> X X>w + b1 y = K>r = = 0 . (14) L ω L ¡ ¢ j Hence we require, that for every complex feature zω the mixed moments j between the residual error ri and the measured values Kij should be zero.

2.3 The Potential Support Vector Machine (P-SVM) 2.3.1 The Basic P-SVM The new objective from eq. (4) and the new constraints from eq. (14) constitute a new procedure for selecting a classier or a regression function. The number of constraints is equal to the number P of complex features, which can be larger or smaller than the number L of data points or the dimension N of the original feature space. Because the mean squared error of a linear function fw,b is a convex function of the parameters w and b, the constraints enforce that Remp is minimal2. If K has at least rank L (number of training examples), then the minimum even corresponds to r = 0 (cf. eq. (14)). Consequently, overtting occurs and a regularization scheme is needed. Before a regularization scheme can be dened, however, the mixed moments j must be normalized. The reason is, that high values of j may either be a result of a high variance of the values of Kij or the result of a high correlation between the residual error ri and the values of Kij . Since we are interested in the latter the most appropriate measure would be Pearson’s correlation coecient L i=1 Kij Kj (ri r) ˆj = , (15) 2 L P K ¡ K ¢ L (r r)2 i=1 ij j i=1 i q q 2 P >¡ b 1¢ P > Note, that w = “X ” (y ) fullls the constraints, where “X ” is the pseudo- > inverse of X .

9 1 L 1 L where r = L i=1 ri and Kj = L i=1 Kij . If every column vec- tors (K1j , K2j , . . . , KLj ) of the data matrix K is normalized to mean zero and variance one, we obtainP P

1 L 1 = K r = ˆ r r1 . (16) j L ij i j √ k k2 i=1 L X Because r = 0 (eq. (11)), the mixed moments are now proportional to the correlation coecient ˆj with a proportionality constant which is independent j of the complex feature zω, and j can be used instead of ˆj to formulate the constraints. If the data vectors are normalized, the term K>1 vanishes and we obtain the basic P-SVM optimization problem

Basic P-SVM optimization problem 1 min X> w 2 (17) w 2 k k s.t. K> X> w y = 0 . ¡ ¢ The oset b of the classication or regression function is given by eq. (12) which simplies after normalization to (see Hochreiter and Obermayer, 2004a)

1 L b = y . (18) L i i=1 X We will call this model selection procedure the Potential Support Vector Machine (P-SVM), and we will always assume normalized data vectors in the following.

2.3.2 The Kernel Trick Following the standard “support vector” way to derive learning rules for per- ceptrons, we have so far considered linear classiers only. Most practical clas- sication problems, however, require non-linear classication boundaries which makes the construction of a proper feature space necessary. In analogy to the standard SVM, we now invoke the kernel trick. Let xi and zj be feature vectors, which describe the “column” and the “row” objects of the dataset. We then choose a kernel function k xi, zj and compute the matrix K of relations between “column” and “row” objects: ¡ ¢ i j i j i j Kij = k x , z = x , ω z = x, zω , (19)

i i ¡ j¢ ­ ¡ j¢ ¡ ¢® ­ ® where x = x and zω = ω z . In Appendix A it is shown that any L2-kernel corresponds (for almost all points) to a dot product in a Hilbert ¡ ¢ ¡ ¢ space in the sense of eq. (19) and induces a mapping into a feature space within

10 which a linear classier can then be constructed. In the following chapters we will, therefore, distinguish between the actual measurements xi and zj and the i j feature vectors x, and zω “induced” by the kernel k. Potential choices for “row” objects and their vectorial description are zj = xj, P = L, (standard construction of a Gram matrix), zj = ej, P = N, (“elementary” features), zj is the jth cluster center of a clustering algorithm applied to the vectors xi (example for a “complex” feature), or zj is the jth vector of an PCA or ICA preprocessing (another example for a “complex” feature). If the entries Kij of the data matrix are directly measured, the application of the kernel trick needs additional considerations. In appendix A we show, that - if the measurement process can be expressed through a kernel k xi, zj which takes a column object xi and a row object zj and outputs a number - the matrix K of relations between the “row” and “column” objects can be in¡ terpreted¢ as a dot product i j Kij = x, zω (20) i i j j in some features space, where x = ­ x ®and zω = ω z . Note, that we distinguish between an object xi and its associated feature vectors xi or i ¡ ¢ ¡ ¢ x, leading to dierences in the denition of k for the cases of vectorial and (measured) dyadic data. Eq. (20) justies the P-SVM approach, which was derived for the case of linear predictors, also for measured data.

2.3.3 The P-SVM for Classication If the P-SVM is used for classication, we suggest a regularization scheme based on slack variables + and . Slack variables allow for small violations of individual constraints if the correct choice of w would lead to a large increase of the objective function otherwise. We obtain

1 > 2 > + min X w + C1 + (21) w,ξ+,ξ 2 k k s.t. K> X> w y +¡ + 0¢ K> X> w y 0 ¡ ¢ 0 +, ¡ ¢ for the primal problem. Above regularization scheme makes the optimization problem robust against “outliers”. A large value of a slack variable indicates, that the particular “row” object only weakly inuences the direction of the classication boundary, be- cause it would otherwise considerably increase the value of the complexity term. This happens in particular for high levels of measurement noise which leads to large, spurious values of the mixed moments j. If the noise is large, the value of C must be small to “remove” the corresponding constraints via the slack vari- ables . If the strength of the measurement noise is known, the correct value of C can be determined a priori. Otherwise, it must be determined using model selection techniques.

11 To derive the dual optimization problem, we evaluate the Lagrangian L, 1 L = w> X X> w + C1> + + (22) 2 > + K> X> w ¡y + + ¢ > + ¡¢ ¡K> ¡X> w y¢ ¢ > > ¡+ ¢ ¡+ ¡ ¢, ¢ (23) where the vectors¡ +¢ 0, ¡ 0¢, + 0, and 0 are the Lagrange multipliers for the constrain ts ineqs. (21). The optimality conditions require that

> wL = X X w X K (24) ∇ = X X> w X X>Z = 0 , ω where = + ( = + ). Eq. (24) is fullled, if i i i w = Zω . (25)

In contrast to the standard SVM expansion of w into its support vectors x, the weight vector w is now expanded into a set of complex features zω which we will call “support features” in the following. We then arrive at the dual optimization problem:

P-SVM optimization problem for classication 1 min > K> K y> K (26) α 2 s.t. C 1 C1 , which now depends on the data via the kernel or data matrix K only. The dual problem can be solved by a Sequential Minimal Optimization (SMO) technique which is described in (Hochreiter and Obermayer, 2004a). The SMO technique is essential if many complex features are used, because the P P matrix K >K enters the dual formulation. Finally, the classication function f has to be constructed using the optimal values of the Lagrange parameters :

P-SVM classication function

P

f(x) = j K(x)j + b , j=1 X where the expansion eq. (25) has been used for the weight vector w and b is given by eq. (18).

12 The classier based on eq. (27) depends on the coecients j, which were determined during optimization, on b, which can be computed directly, and on + the measured values K(x)j for the new object x. The coecients j = j j can be interpreted as class indicators, because they separate the complex features into features which are relevant for class 1 and class -1, according to the sign of j . Note, that if we consider the Lagrange parameters j as parameters of the classier, we nd that

dRemp f j w + t zω , b ∂Remp[f] = j = . (27) h dt i ∂j

The directional derivatives of the empirical error Remp along the complex fea- tures in the primal formulation correspond to its partial derivatives with respect to the corresponding Lagrange parameter in the dual formulation. One of the most crucial properties of the P-SVM procedure is, that the dual optimization problem only depends on K via K>K. Therefore, K is neither required to be positive semidenite nor to be square. This allows not only the construction of SVM-based classiers for matrices K of general shape but also to extend SVM-based approaches to the new class of indenite kernels operating on the objects’ feature vectors. In the following we illustrate the application of the P-SVM approach to classication using a toy example. The data set consists of 70 objects x, 28 from class 1 and 42 from class 2, which are described by 2D-feature vectors x (open and solid circles in Fig. 4). Two pairwise data sets were then generated by applying a standard RBF-kernel and the (indenite) sine-kernel k xi, xj = sin xi xj which leads to an indenite Gram matrix. Fig. 4 shows the ¡ ¢ classicationk result.k The sine-kernel is more appropriate than the RBF-kernel for ¡this data set¢ because it is better adjusted to the “oscillatory” regions of class membership, leading to a smaller number of support vectors and to a smaller test error. In general, the value of has to be selected using standard model selection techniques. A large value of leads to a more “complex” set of classiers, reduces the classication error on the training set, but increases the error on the test set. Non-Mercer kernels extend the range of kernels which are currently used and, therefore, opens up a new direction of research for kernel design.

2.3.4 The P-SVM for Regression The new objective function of eq. (4) was motivated for a classication problem but it can also be used to nd an optimal regression function in a regression 2 2 task. In regression the term X> w = X> wˆ w 2, wˆ := w/ w , k k k k is the product of a term whic°h expresses° the° deviation° of the data from the ° ° ° ° zero-isosurface of the regression° function° and°a term°which corresponds to the smoothness of the regressor. The smoothness of the regression function is deter- i i mined by the norm of the weight vector w. If f(x) = w, x + b and the D E 13 kernel: RBF, SV: 50, σ: 0.1, C: 100 kernel: sine, SV: 7, θ: 25, C: 100 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2

Figure 4: Application of the P-SVM method to a toy classication problem. Ob- jects are described by 2D-feature vectors x. 70 feature vectors were generated, 28 for class 1 (open circles) and 42 for class 2 (solid circles). A Gram matrix i j 1 i j 2 was constructed using the kernels “RBF” k x , x = exp 2 x x k k (left) and “sine” k xi, xj = sin xi xj (right). White and gray indi- ¡ ¢ ¡ ¢ cate regions of class 1 and class 2 memk bership. k Circled data indicate support ¡ ¢ ¡ ¢ vectors (“SV”). For parameters see gure.

length of the vectors x is bounded by B, then the deviation of f from oset b is bounded by: f(xi ) b = w, xi w xi w B , (28) k k k k where°the rst “ ”°follows°­ from ®the° Cauchy-Sc° hw°arz inequality, hence a smaller ° ° ° ° ° ° value of w leads to a smoother regression function. This tradeo between smoothnessk kand residual error is also reected by eq. (59) in Appendix A which shows that eq. (4) is the L2-norm of the function f. The constraints of vanishing mixed moments carry over to regression problems with the only modication, that the target values yi in eqs. (21) are real rather than binary numbers. The constraints are even more “natural” for regression because the ri are indeed the residuals a regression function should minimize. We, therefore, propose to use the primal optimization problem, eqs. (21), and its corresponding dual, eqs. (26), also for the regression setting. Fig. 5 shows the application of the P-SVM to a toy regression example (pairwise data). 50 data points are randomly chosen from the true function (dashed line) and i.i.d. Gaussian noise with mean 0 and standard deviation 0.2 is added to each y-component. One outlier was added at x = 0. The gure shows the P-SVM regression results (solid lines) for an RBF-kernel and three dierent combinations of C and . The hyperparameter C controls the residual error: A smaller value of C increases the residual error at x = 0 but also the number of support vectors. The width of the RBF-kernel controls the overall smoothness of the regressor: A larger value of increases the error at x = 0 without increasing the number of support vectors (arrows in bold in Fig. 5).

14 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 SV: 7 3 SV: 5 2 σ: 2, C: 20 2 σ: 4, C: 20 1 1 x = −2 x = −2 0 0 −10 −8 −6 −4 −2 0 2 4 6 8 10 −10 −8 −6 −4 −2 0 2 4 6 8 10

10 9 8 7 6 5 4 3 SV: 14 2 σ: 2, C: 0.6 1 x = −2 0 −10 −8 −6 −4 −2 0 2 4 6 8 10

Figure 5: Application of the P-SVM method to a toy regression problem. Ob- jects (small dots), described by the x-coordinate, were generated by randomly choosing a value from [ 10, 10] and calculating the y-value from the true func- tion (dashed line) plusGaussian noise (0, 0.2). One outlier was added at x = 0 (thin arrows). A Gram matrixNwas generated using an RBF-kernel i j 1 i j 2 k x , x = exp 2 x x . The solid lines show the regression result. Circled dots indicate k supp ortkvectors (“SV”). For parameters see gure. Bold¡ arro¢ws mark x¡= 2. ¢

2.3.5 The P-SVM for Feature Selection In this section we modify the P-SVM method for feature selection such that it can serve as a data preprocessing method in order to improve the generalization performance of subsequent classication or regression tasks (see also Hochreiter and Obermayer, 2004b). Due to the property of the P-SVM method to expand w into a sparse set of support features, it can be modied to optimally extract a small number of “informative” features from the set of “row” objects. The set of “support features” can then be used as input to an arbitrary predictor, e.g. another P-SVM, a standard SVM or a K-nearest-neighbor classier. Noisy measurements can lead to spurious mixed moments, i.e. complex fea- tures may contain no information about the objects’ attributes but still exhibit nite values of j. In order to prevent those features to aect the classication

15 boundary or the regression function, we introduce a “correlation threshold” and modify the constraints in eqs. (17) according to

K> X> w y 1 0 , (29) K> X> w y + 1 0 . ¡ ¢ This regularization¡ scheme is¢analogous to the -insensitive loss (Scholkopf and Smola, 2002). Absolute values of mixed moments smaller than are considered to be spurious, the corresponding features do not inuence the weight vector, because the constraints remain fullled. The value of directly correlates with the strength of the measurement noise, and can be determined a priori if it is known. If this is not the case, serves as a hyperparameter and its value can be determined using model selection techniques. Note, that data vectors have to be normalized (see Subsection 2.3.1) before applying the P-SVM, because otherwise a global value of would not suce. Combining eq. (4) and eqs. (29) we then obtain the primal optimization problem 1 min X> w 2 (30) w 2 k k s.t. K> X> w y + 1 0 K> X> w y 1 0 ¡ ¢ for P-SVM feature selection.¡ In order¢ to derive the dual formulation we have to evaluate the Lagrangian:

1 > L = w> X X> w + K> X> w y + 1 (31) 2 > + K> X> ¡w ¢ y¡ ¡ 1 , ¢ ¢ where we have used¡ ¢the¡notation¡ from Section¢ 2.3.3.¢ The vector w is again expressed through the complex features,

w = Zω , (32) and we obtain the dual formulation of eq. (30):

P-SVM optimization problem for feature selection

1 > min + K> K + (33) α+,α 2 ¡y> K +¢ +¡ 1> + +¢ s.t. 0 + , 0 . ¡ ¢ ¡ ¢

The term 1> (+ + ) in this dual objective function enforces a sparse expansion of the weight vector w in terms of the support features. This occurs,

16 because for large enough values of , this term forces all j towards zero except for the complex features which are most relevant for classication or regression. If K> K is singular and w is not uniquely determined, enforces a unique solution, which is characterized by the most sparse representation through com- plex features. The dual problem is again solved by a fast Sequential Minimal Optimization (SMO) technique (see Hochreiter and Obermayer, 2004a). Finally, let us address the relationship between the value of a Lagrange j multiplier j and the “importance” of the corresponding complex feature zω for prediction. The change of the empirical error under a change of the weight j vectors by an amount along the direction of a complex feature zω is given by

Remp f j Remp [fw,b] (34) w + zω , b h 2 i 2 2 = + K2 = + | | + , j 2 L ij j 2 L 2 i X because the constraints eq. (29) ensure that j L . If a complex feature zj is completely removed, then = and| | ω j 2 |j | j Remp f j Remp [fw,b] + . (35) w j zω , b L 2 h i The Lagrange parameter j is directly related to the increase in the empirical error when a feature is removed. Therefore, serve as importance measures for the complex features and allows to rank features according to the absolute value of its components. In the following, we illustrate the application of the P-SVM approach to feature selection using a classication toy example. The data set consists of 50 “column” objects x, 25 from each class, which are described by 2D-feature vectors x (open and solid circles in Fig. 6). 50 “row” objects z were randomly selected by choosing their 2D-feature vectors z according to an uniform dis- tribution on [ 1.2, 1.2] [ 1.2, 1.2]. K was generated using an RBF kernel i j 1 i j 2 k x , z = exp 2 2 x z with = 0.2. Fig. 6 shows the result of the P-SVM featureselectionk metho dk with a correlation threshold = 20. Only ¡ ¢ ¡ ¢ six features were selected (crosses in Fig. 6), but every group of data points (“column” objects) is described (and detected) by one or two feature vectors (“row” objects). The number of selected features depends on , and on , which determines how the “strength” of a complex feature decreases with the distances xi zj . Smaller or larger would result in a larger number of support kfeatures. k

2.4 The Dot Product Interpretation of Dyadic Data The derivation of the P-SVM method is based on the assumption that the ma- trix K is a dot product matrix whose elements denote a scalar product between the feature vectors which describe the “row” and the “column” objects. If K, however, is a matrix of measured values the question arises under which condi- tions such a matrix can be interpreted as a dot product matrix. In Appendix A we show, that above question reduces to

17 Figure 6: P-SVM method applied to a toy feature selection problem. “Column” and

1 “row” objects are described by 2D-feature vec- tors x and z. 50 “row” feature vectors, 25 0.5 from each class (open / solid circles), were SV: 6, chosen from (( 1, 1), 0.1I), where centers σ: 0.1, : 20 0 are chosen withN equal probability. 100 col- −0.5 umn feature vectors were chosen uniformly from [ 1.2, 1.2]2. K is constructed by an −1 1 i j 2 RBF-kernel exp 2 0.22 x z . Black −1 −0.4 0 0.4 1 crosses indicate P-SVM suppk ort features.k ¡ ¢

(1) Can the set of “column” objects x be completed to a measure space? X (2) Can the set of “row” objects z be completed to a measure space? Z (3) Can the measurement process be expressed via the evaluation of a measurable kernel k (x, z) which is from L2( , )? X Z (1) and (2) can be easily fullled by dening a suitable -algebra on the sets; (3) holds if k is bounded and the sets and are compact. Condition (3) equates the evaluation of a kernel as knoXwn fromZstandard SVMs with physical measurements, and physical characteristics of the measurement device deter- mines the properties of the kernel, e.g. boundedness and continuity. But can a measurement process indeed be expressed through a kernel? There is no full answer to this question from a theoretical viewpoint, practical applications have to conrm (or disprove) the chosen ansatz and data model.

3 Numerical Experiments and Applications

In this section we apply the P-SVM method to various kinds of real world data sets and provide benchmark results with previously proposed methods when appropriate. This section consists of three parts which cover classication, re- gression, and feature selection. In part one the P-SVM is rst tested as a classier on data sets from the UCI Benchmark Repository and its performance is compared with results obtained with C- and the -SVMs for dierent kernels. Secondly, we apply the P-SVM to a measured (rather than constructed) pair- wise (“protein”) and a genuine dyadic data set (“World Wide Web”). In part two the P-SVM is applied to regression problems taken from the UCI bench- mark repository and compared with results obtained with C-Support Vector Regression and Bayesian SVMs. Part three describes results obtained for the P-SVM as a feature selection method for microarray data and for the “protein” and “World Wide Web” data sets from part one.

18 3.1 Application to Classication Problems 3.1.1 UCI Data Sets In this section we report benchmark results for the data sets “thyroid” (5 fea- tures), “heart” (13 features), “breast-cancer” (9 features), and “german” (20 features) from the UCI benchmark repository, and for the data set “banana” (2 features) taken from (Ratsch et al., 2001). All data sets were preprocessed as described in (Ratsch et al., 2001) and divided into 100 training/test set pairs. Data sets were generated through resampling where data points were randomly selected for the training set and the remaining data was used for the test set. We downloaded the original 100 training/test set pairs from http://ida.first.fraunhofer.de/projects/bench/. For the data sets “ger- man” and “banana” we restricted the training set to the rst 200 examples of the original training set in order to keep hyperparameter selection feasible but used the original test sets. Pairwise datasets were generated by constructing the Gram matrix for radial basis function (RBF), polynomial (POL), and Plummer (PLU, see Hochreiter et al., 2003) kernels, and the Gram matrices were used as input for kernel Fisher discriminant analysis (KFD, Mika et al., 1999), C-, -, and P-SVM. Because KFD only selects a direction in input space onto which all data points are pro- jected, we must select a classier for the resulting one-dimensional classication task. We follow (Scholkopf and Smola, 2002) and classify a data point accord- ing to its posterior probability under the assumption of a Gaussian distribution for each label. Hyperparameters (C, , and kernel parameters) were optimized using 5–fold cross validation on the corresponding training sets. To ensure a fair comparison, the hyperparameter selection procedure was equal for all methods, except that the values of the -SVM were selected from 0.1, . . . , 0.9 in con- trast to the selection of C for which a logarithmic scale w{as used. To}test the signicance of the dierences in generalization performance (percent of misclas- sications), we calculated, for what percentage of the individual runs (for a total of 100, see below) our method was signicantly better or worse than others. In order to do so, we rst performed a test for the “dierence of two proportions” for each training/test set pair (Dietterich, 1998). The “dierence of two pro- portions” is the dierence of the misclassication rates of two models on the test set, where the models are selected on the training set by the two methods which are to be compared. After this test we adjusted the false discovery rate through the “Benjamini Hochberg Procedure” (Benjamini and Hochberg, 1995) which was recently shown to be correct for dependent outcomes of the tests (Benjamini and Yekutieli, 2001). The fact that the tests can be dependent is important because training and test sets overlap for the dierent training/test set pairs. The false detection rate was controlled at 5 %. We counted for each pair of methods the selected models from the rst method which perform sig- nicantly (5 % level) better than the selected models from the second method.

Table 1 summarizes the percentage of misclassication averaged over 100

19 RBF POL PLU Thyroid C-SVM 5.11 (2.34) 4.51 (2.02) 4.96 (2.35) -SVM 5.15 (2.23) 4.04 (2.12) 4.83 (2.03) KFD 4.96 (2.24) 6.52 (3.18) 5.00 (2.26) P-SVM 4.71 (2.06) 9.44 (3.12) 5.08 (2.18) Heart C-SVM 16.67 (3.51) 18.26 (3.50) 16.33 (3.47) -SVM 16.87 (3.87) 17.44 (3.90) 18.47 (7.81) KFD 17.82 (3.45) 22.53 (3.37) 17.80 (3.86) P-SVM 16.18 (3.66) 16.67 (3.40) 16.54 (3.64) Breast-Cancer C-SVM 26.94 (5.18) 26.87 (4.79) 26.48 (4.87) -SVM 27.69 (5.62) 26.69 (4.73) 30.16 (7.83) KFD 27.53 (4.19) 31.30 (6.11) 27.19 (4.72) P-SVM 26.78 (4.58) 26.40 (4.54) 26.66 (4.59) Banana C-SVM 11.88 (1.11) 12.09 (0.96) 11.81 (1.07) -SVM 11.67 (0.90) 12.72 (2.16) 11.74 (0.98) KFD 12.32 (1.12) 14.04 (3.89) 12.14 (0.96) P-SVM 11.59 (0.96) 14.93 (2.09) 11.52 (0.93) German C-SVM 26.51 (2.60) 27.27 (3.23) 26.88 (3.12) -SVM 27.14 (2.84) 27.13 (3.06) 28.60 (6.27) KFD 26.58 (2.95) 30.96 (3.23) 26.90 (3.15) P-SVM 26.45 (3.20) 25.87 (2.45) 26.65 (2.95)

Table 1: Average percentage of misclassication for the UCI and the “banana” data sets. The table compares results obtained with the kernel Fisher discrimi- nant analysis (KFD), C-, -, and P-SVM for the Radial Basis Function (RBF), 1 i j 2 i j exp 2 x x , polynomial (POL), x , x + , and Plummer 2 k k i j (PLU),¡ x x + ¢ kernels. Results¡­were av®eraged¢over 100 experi- ments withk separate k training and test sets. For each data set numbers in bold ¡ ¢ and italic highlight the best and the second best result, the numbers in brack- ets denote standard deviations of the trials. C, , and kernel parameters were determined using 5–fold cross validation on the training set and usually diered between individual experiments.

experiments. Despite the fact that C– and –SVMs are equivalent, results dier because of the somewhat dierent model selection results for the hyper- parameters C and . Best and second best results are indicated by bold and italic numbers. The signicance tests did not reveal signicant dierences over the 100 runs in generalization performance for the majority of the cases (for

20 details see http://ni.cs.tu-berlin.de/publications/psvm_sup). The P- SVM with “RBF” and “PLU” kernels, however, never performed signicantly worse for any of the 100 runs of each data sets than their competitors. Note, that the signicance tests do not tell whether the averaged misclassication rates of one method are signicantly better/worse than the rates of another method. They provide information on how many runs out of 100 one method was signicantly better than another method. The UCI-benchmark result shows that the P-SVM is competitive to other state-of-the-art methods for a standard problem setting (measurement matrix equivalent to the Gram matrix). Although the P-SVM method never performed signicantly worse, it generally required fewer support vectors than other SVM approaches. This was also true for the cases, where the P-SVM’s performance was signicantly better.

3.1.2 Protein Data Set The “protein” data set (cf. Hofmann and Buhmann, 1997) was provided by M. Vingron and consists of 226 proteins from the globin families. Pairs of proteins are characterized by their evolutionary distance, which is dened as the probability of transforming one amino acid sequence into the other via point mutations. Class labels are provided, which denote membership in one of the four families: hemoglobin- (“H-”), hemoglobin- (“H-”), myoglobin (“M”), and heterogenous globins (“GH”). Table 2 summarizes the classication results, which were obtained with the generalized SVM (G-SVM, Graepel et al., 1999; Mangasarian, 1998) and the P-SVM method. Since the G-SVM interprets the columns of the data matrix as feature vectors for the column objects and applies a standard -SVM to these vectors (this is also called “feature map”, Scholkopf and Smola, 2002), we call the G-SVM simply “-SVM” in the following. The table shows the percentage of misclassication for the four two-class classication problems “one class against the rest”. The P-SVM yields better classication results although a conservative test for signicance was not possible due to the small number of data. However, the P-SVM selected 180 proteins as support vectors on average, compared to 203 proteins used by the -SVM (note that for 10–fold cross validation 203 is the average training size). Here a smaller number of support vectors is highly desirable, because it reduces the computational costs of sequence alignments which are necessary for the classication of new examples.

3.1.3 World Wide Web Data Set The “World Wide Web” data sets consist of 8,282 WWW-pages collected during the Web Kb project at Carnegie Mellon University in January 1997 from the web sites→of the computer science departments of the four institutions Cornell, Texas, Washington, and Wisconsin. The pages were manually classied into the categories “student”, “faculty”, “sta”, “department”, “course”, “project”, and “other”.

21 protein data Reg. H- H- M GH Size — 72 72 39 30 -SVM 0.05 1.3 4.0 0.5 0.5 -SVM 0.1 1.8 4.5 0.5 0.9 -SVM 0.2 2.2 8.9 0.5 0.9 P-SVM 300 0.4 3.5 0.0 0.4 P-SVM 400 0.4 3.1 0.0 0.9 P-SVM 500 0.4 3.5 0.0 1.3

Table 2: Percentage of misclassication for the “protein” data set for classiers obtained with the P- and -SVM methods. Column “Reg.” lists the values of the regularization parameter ( for -SVM and C for P-SVM). Columns “H- ” to “GH” provide the percentage of misclassication for the four problems “one class against the rest” (“size” denote the number of proteins per class), computed using 10–fold cross validation. The best results for each problem are shown in bold. The data matrix was not positive semi-denite.

Every pair (i, j) of pages is characterized by whether page i contains a hy- perlink to page j and vice versa. The data is summarized using two binary matrices and a ternary matrix. The rst matrix K (“out”) contains a one for at least one outgoing link (i j) and a zero if no outgoing link exists, the sec- ond matrix K> (“in”) contains→ a one for at least one ingoing link (j i) and a 1 > → zero otherwise, and the third, ternary matrix 2 K + K (“sym”) contains a zero, if no link exists, a value of 0.5, if only one unidirectional link exists, and ¡ ¢ a value of 1, if links exists in both directions. In the following, we restricted the data set to pages from the rst six classes which had more than one in- or outgoing link. The data set thus consists of the four subsets “Cornell” (350 pages), “Texas” (286 pages), “Wisconsin” (300 pages), and “Washington” (433 pages). Table 3 summarizes the classication results for the C- and P-SVM meth- ods. The parameter C for both SVMs was optimized for each cross valida- tion trial using another 4–fold cross validation on the training set. Signicance tests were performed to evaluate the dierences in generalization performance using the 10-fold cross-validated paired t-test (Dietterich, 1998). We consid- ered 48 tasks (4 universities, 4 classes, 3 matrices) and checked for each task whether the C-SVM or the P-SVM performed better using a p-value of 0.05. In 30 tasks the P-SVM had a signicantly better performance than the C- SVM, while the C-SVM was never signicantly better than the P-SVM (for details see http://ni.cs.tu-berlin.de/publications/psvm_sup). Classi- cation results are better for the asymmetric matrices “in” and “out” than for the symmetric matrix “sym”, because there are cases for which highly indicative pages (hubs) exist which are connected to one particular class of pages by either in- or outgoing links. The symmetric case blurs the contribution of the indica-

22 tive pages because ingoing and outgoing links can no longer be distinguished which leads to poorer performance. Because the P-SVM yields fewer support vectors, online classication is faster than for the C-SVM and – if web pages cease to exist – the P-SVM is less likely to be aected.

3.2 Application to Regression Problems In this section we report results for the data sets “robot arm” (2 features), “boston housing” (13 features), “computer activity” (21 features), and “abalone” (10 features) from the UCI benchmark repository. The data preprocessing is described in (Chu et al., 2004), and the data sets are available as training / test set pairs at http://guppy.mpe.nus.edu.sg/\~{}chuwei/data. The size of the data sets were (training set / test set): “robot arm”: 200 / 200, 1 set; “boston housing”: 481 / 25, 100 sets; “computer activity”: 1000 / 6192, 10 sets; “abalone”: 3000 / 1177, 10 sets. Pairwise data sets were generated by constructing the Gram matrices for Radial Basis Function kernels of dierent widths , and the Gram matrices were used as input for the three regression methods, C-support vector regression (SVR, Scholkopf and Smola, 2002), Bayesian support vector regression (BSVR, Chu et al., 2004), and the P-SVM. Hyperparameters (C and ) were optimized using n-fold cross-validation (n = 50 for “robot arm”, n = 20 for “boston housing”; n = 4 for “computer activity” and n = 4 for “abalone”). Parameters were rst optimized on a coarse 4 4 grid and later rened on a 7 7 ne grid around the values for C and selected in the rst step (65 tests per parameter selection). Table 4 shows the mean squared error and the standard deviation of the re- sults. We also performed a Wilcoxon signed rank test to verify the signicance for these results (for details see http://ni.cs.tu-berlin.de/publications/ psvm_sup), except for the “robot arm” data set, which has only one training/test set pair, and the “boston housing” data set, which contains too few test exam- ples. On “computer activity” SVR was signicantly better (5 % threshold) than BSVR, and on both data sets “computer activity” and “abalone”, SVR and BSVR were signicantly outperformed by the P-SVM which in addition used fewer support vectors than its competitors.

3.3 Application to Feature Selection Problems In this section we apply the P-SVM to feature selection problems of various kinds, using the “correlation threshold” regularization scheme (Section 2.3.5). We rst reanalyze the “protein” and “world wide web” data sets of sections 3.1.3 and 3.1.2 and then report results on three DNA microarray data sets. Further feature selection results can be found in (Hochreiter and Obermayer, 2005) where also results for the NIPS 2003 feature selection challenge are reported and where the P-SVM was the best stand-alone method for selecting a compact feature set, and in (Hochreiter and Obermayer, 2004b), where details of the microarray datasets benchmarks are reported.

23 Course Faculty Project Student Cornell University Size 57 60 52 143 C-SVM (Sym) 11.1 (6.2) 19.7 (5.3) 13.7 (4.0) 50.0 (11.5) C-SVM (Out) 12.6 (3.1) 15.1 (6.0) 10.6 (4.9) 22.3 (10.3) C-SVM (In) 11.1 (4.9) 21.4 (4.3) 14.6 (5.5) 48.9 (15.5) P-SVM (Sym) 12.3 (3.3) 17.1 (6.2) 15.4 (6.3) 19.1 (5.7) P-SVM (Out) 8.6 (3.8) 14.3 (6.3) 8.3 (4.9) 16.9 (7.9) P-SVM (In) 7.1 (4.1) 13.7 (6.6) 10.9 (5.5) 17.1 (5.7) Texas University Size 52 35 29 129 C-SVM (Sym) 17.2 (9.0) 22.0 (9.1) 19.8 (6.7) 53.5 (11.8) C-SVM (Out) 9.5 (5.1) 16.5 (5.8) 20.2 (8.7) 28.9 (11.7) C-SVM (In) 12.6 (4.9) 20.6 (7.8) 20.9 (5.1) 16.4 (7.6) P-SVM (Sym) 15.8 (5.8) 13.6 (7.3) 12.2 (6.2) 25.5 (6.9) P-SVM (Out) 8.1 (7.6) 9.8 (3.6) 9.8 (3.9) 20.9 (6.7) P-SVM (In) 12.3 (5.6) 10.5 (6.3) 9.4 (4.6) 13.0 (5.0) Wisconsin University Size 77 36 22 117 C-SVM (Sym) 27.0 (10.0) 22.0 (5.5) 14.0 (6.4) 49.3 (11.1) C-SVM (Out) 19.3 (7.5) 16.0 (3.8) 10.3 (4.8) 34.3 (10.5) C-SVM (In) 22.0 (8.6) 16.3 (5.8) 7.7 (4.5) 24.3 (9.9) P-SVM (Sym) 18.7 (4.5) 15.0 (9.3) 10.0 (5.4) 34.3 (8.6) P-SVM (Out) 12.0 (5.5) 11.3 (6.5) 7.7 (4.2) 23.7 (4.8) P-SVM (In) 13.3 (4.4) 8.7 (8.2) 6.3 (7.1) 13.3 (5.9) Washington University Size 169 44 39 151 C-SVM (Sym) 19.6 (6.8) 18.7 (6.8) 10.6 (3.5) 43.6 (8.3) C-SVM (Out) 10.6 (4.6) 14.1 (3.0) 14.3 (4.8) 28.2 (9.8) C-SVM (In) 20.3 (6.4) 20.4 (5.3) 13.8 (4.7) 38.3 (11.9) P-SVM (Sym) 17.1 (4.4) 13.4 (6.6) 8.8 (2.1) 20.3 (6.8) P-SVM (Out) 10.6 (5.2) 12.7 (2.9) 6.7 (3.4) 17.1 (4.3) P-SVM (In) 11.8 (5.6) 9.2 (6.2) 6.7 (2.0) 14.3 (6.9)

Table 3: Percentage of misclassication for the World Wide Web data sets for classiers obtained with the P- and C-SVM methods. The percentage of misclassication was measured using 10–fold cross-validation. The best and second best results for each data set and classication task are indicated in bold and italics; numbers in brackets denote standard deviations of the trials.

3.3.1 Protein and World Wide Web Data Sets In this section we again apply the P-SVM to the “protein” and “world wide web” data sets of sections 3.1.2 and 3.1.3. Using both regularization schemes

24 robot boston computer arm (103) housing activity abalone

SVR 5.84 10.27 (7.21) 13.80 (0.93) 0.441 (0.021) BSVR 5.89 12.34 (9.20) 17.59 (0.98) 0.438 (0.023) P-SVM 5.88 9.42 (4.96) 10.28 (0.47) 0.424 (0.017)

Table 4: Regression results for the UCI data sets. The table shows the mean squared error and its standard deviation in brackets. Best results for each data set are shown in bold. For the “robot arm” data only one data set was available and, therefore, no standard deviation is given.

simultaneously leads to a trade-o between a small number of features (a small number of measurements) and better classication result. Reducing the number of features is benecial if measurements are costly and if a small increase in prediction error can be tolerated. Table 5 shows the results for the “protein” data sets for various values of the regularization parameter . C was set to 100, because it gave good results for a wide range of values. We chose a minimal = 0.2 because it resulted in a classier, where all complex features were support vectors. The size of the training set is 203. Note, that C was smaller than in the experiments in Section 3.1.2 because large values of push the dual variables towards zero and large values of C have no inuence. The table shows that classication performance drops if less features are considered, but that 5 % of the features suce to obtain a performance which lead only to about 5 % misclassications compared to about 2 % at the optimum. Since every feature value has to be determined via a , this saving in computation time is essential for large data bases like the Swiss-Prot data base (130,000 proteins), where supplying all pairwise relations is currently impossible.

protein data H- H- M GH 0.2 1.3 (203) 4.9 (203) 0.9 (203) 1.3 (203) 1 2.6 (41) 5.3 (110) 1.3 (28) 4.4 (41) 10 3.5 (10) 8.8 (26) 1.8 (5) 13.3 (7) 20 3.5 (5) 8.4 (12) 4.0 (4) 13.3 (5)

Table 5: Percentage of misclassication and the number of support features (in brackets) for the “protein” data set for the P-SVM method. The total number of features is 226. C was 100. The ve columns (left right) show the values for and the results for the four classication problems→ “one class against the rest” using 10–fold cross-validation.

Table 6 shows the corresponding results (10–fold cross validation) for the

25 P-SVM applied to the “world wide web” data set “Cornell” and for the classi- cation problem “student pages vs. the rest”. Only ingoing links (matrix K > of Section 3.1.3) were used. C was optimized using 3–fold cross validation on the corresponding training sets for each of the 10–fold cross validation runs. By increasing the regularization parameter the number of web pages which have to be considered in order to classify a new page (the number of support vectors) decreases from 135 to 8. At the same time the percentage of pages which can no longer be classied because they receive no ingoing link from one of the “support vector page” increases. The percentage of misclassication, however, is reduces from 14 % for = 0.1 to 0.6 % for = 2.0. With only 8 pages providing ingoing links more than 50 % of the pages could be classied with only 0.6 % misclassication rate.

“Cornell” data set, student pages % cl. % mis. # (%) SVs % cl. % mis. # (%) SVs 0.1 84 14 135 (38.6) 0.8 65 3.1 34 (9.7) 0.2 81 12 115 (32.8) 0.9 64 2.7 32 (9.1) 0.3 79 9.7 99 (28.3) 1.0 61 1.4 27 (7.7) 0.4 75 6.9 72 (20.6) 1.1 59 1.0 21 (6.0) 0.5 73 5.5 58 (16.6) 1.4 56 1.0 12 (3.4) 0.6 71 4.8 48 (13.7) 1.6 55 1.0 10 (2.8) 0.7 66 3.9 38 (10.9) 2.0 51 0.6 8 (2.3)

Table 6: P-SVM feature selection and classication results (10–fold cross vali- dation) for “world wide web” data set “Cornell” and the classication problem “student pages against the rest”. The columns show (left right): the value , the percentage of classied pages (“cl.”), the percentage →of misclassications (“mis.”) and the number (and percentage) of support vectors. C was obtained by a 3-fold cross validation on the corresponding training sets.

3.3.2 DNA Micorarray Data Sets In this subsection we describe the application the P-SVM to DNA microarray data. The data was taken from Pomeroy et al. (2002) (brain tumor), Shipp et al. (2002) (lymphoma tumor), and van’t Veer et al. (2002) (breast cancer). All three data sets consist of tissue samples from dierent patients which were charac- terized by the expression values of a large number of genes. All samples were labeled according to the outcome of a particular treatment (positive/negative) and the task was to predict the outcome of the treatment for a new patient. We compared the classication performance of the following combinations of a feature selection and a classication method:

26 feature selection classication (1) expression value of the TrkC gene one gene classication (2) SPLASH (Califano et al., 1999) likelihood ratio classier (LRC) (3) signal-to-noise-statistics (STN) K-nearest neighbor (KNN) (4) signal-to-noise-statistics (STN) weighted voting (voting) (5) Fisher statistics (Fisher) weighted voting (voting) (6) R2W2 R2W2 (7) P-SVM -SVM

The P-SVM/-SVM results are taken from (Hochreiter and Obermayer, 2004b), where the details concerning the data sets and the gene selection procedure based on the P-SVM can be found, and the results for the other combinations wer taken from the references cited above. All results are summarized in Table 7. The comparison shows that the P-SVM method outperforms the standard methods.

Brain Tumor Lymphoma

Feature Selection / # # Feature Selection / # # Classication F E Classication F E TrkC gene 1 33 STN / KNN 8 28 SPLASH / LRC – 25 STN / voting 13 24 R2W2 – 25 R2W2 – 22 STN / voting – 23 P-SVM / -SVM 18 21 STN / KNN 8 22 TrkC & SVM & KNN – 20 P-SVM / -SVM 45 7

Breast Cancer

Feature Selection / # # ROC Classication F E area Fisher / voting 70 26 0.88 P-SVM / -SVM 30 15 0.77

Table 7: Results for the three DNA microarray data sets “brain tumor”, “lym- phoma”, and “breast cancer” (see text). The table shows the leave-one-out classication error E (% misclassications) and the number F of selected fea- tures. For breast cancer only the minimal value of E for dierent thresholds was available, therefore the area under a receiver operating curve is provided in addition. Best results are shown in bold.

27 4 Summary

In this contribution we have described the Potential Support Vector Machine (P-SVM) as new method for classication, regression, and feature selection. The P-SVM selects models using the principle of structural risk minimization. In contrast to standard SVM approaches, however, the P-SVM is based on a new objective function and a new set of constraints which lead to an expansion of the classication or regression function in terms of “support features”. The optimization problem is quadratic, always well dened, suited for dyadic data, and neither requires square nor positive denite Gram matrices. Therefore, the method can also be used without preprocessing with matrices which are mea- sured and with matrices which are constructed from a vectorial representation using an indenite kernel function. In feature selection mode the P-SVM al- lows to select and rank the features through the support vector weights of its sparse set of support vectors. The sparseness constraint avoids sets for features, which are redundant. In a classication or regression setting this is an advan- tage over statistical methods where redundant features are often kept as long as they provide information about the objects’ attributes. Because the dual for- mulation of the optimization problem can be solved by a fast SMO technique, the new P-SVM can be applied to data sets with many features. The P-SVM approach was compared with several state-of-the-art classication, regression and feature selection methods. Whenever signicance tests could be applied, the P-SVM never performed signicantly worse than its “competitor”, in many cases it performed signicantly better. But even if no signicant improvement in prediction error could be found, the P-SVM needed less “support features”, i.e. less measurements, for evaluating new data objects. Finally, we have suggested a new interpretation of dyadic data, where objects in real world are not described by vectors. Structures like dot products are induced directly through measurements of object pairs, i.e. through relations between objects. This opens up a new eld of research where relations between real world objects determine mathematical structures.

Acknowledgments We thank M. Albery-Speyer, C. Busc her, C. Minoux, R. Sanyal, and S. Seo for their help with the numerical simulations. This work was funded by the Anna-Geissler- and the Monika-Kuntzner-Stiftung.

A Measurements, Kernels, and Dot Products

In this appendix we address the question under what conditions a “measurement kernel” which gives rise to a measured matrix K can be interpreted as a dot product between feature vectors describing the “row” and “column” objects of a “dyadic data” set. Assume that “column” objects x (“samples”) and “row objects” z (“complex features”) are from sets and , which can both be X Z

28 completed by a -algebra and a measure to a measurable spaces. Let ( , B, ) be a measurable space with -algebra B and a -additive measure onUthe set . We consider functions f: R on the set . A function f is called - measurableU on ( , B) if f 1 ([a,U b→]) B for all a,Ub R, and -integrable if f d < . UWe dene ∈ ∈ U ∞ 1 R 2 2 f 2 := f d (36) k kL µZU ¶ and the set

2 L ( ) := f : R; f is -measurable and f 2 < . (37) U U → k kL ∞ 2 n o L ( ) is a Banach space with norm 2 . If we dene the dot product U k kL

f, g 2 := f g d (38) h iL(U) ZU 2 then the Banach space L( ) is a Hilbert space with a dot product , 2 and U h iL(U) 2 2 scalar body R. For simplicity, we denote this Hilbert space by L ( ). L ( 1, 2) is the Hilbert space of functions k with k2 (u , u ) d (uU ) d (Uu )U< U1 U2 1 2 2 1 using the product measure of (U1 U2) = (U1) (U2). With these ∞ 2 R R 2 2 denitions we see that H1 := L ( ), H2 := L ( ), and H3 := L ( , ) are Hilbert spaces of L2-functions withZ domains , X, and , respectivX elyZ . X Z X Z The dot product in Hi is denoted by , Hi . Note, that for discrete or the respective integrals can be replaced hbyisums (using a measure of DiracX Zdelta functions at the discrete points, see Werner, 2000, p. 464, ex. (c)). Let us now assume that k H . k induces a Hilbert-Schmidt operator T : ∈ 3 k

f(x) = (Tk)(x) = k(x, z) (z) d(z) , (39) ZZ which maps H1 (a parameterization) to f H2 (a classier). If we set P ∈ j ∈ (z) = j=1 z , we recover the P-SVM classication function (without b), eq. (27), P ¡ ¢ P P j f(u) = j k u, z = j K(u)j , (40) j=1 j=1 X ¡ ¢ X j j j where j = (z ) and z is the Dirac delta function at location z . The next theorem provides the conditions a kernel must fulll in oder to be interpretable as a dot product¡ ¢ between the objects’ feature vectors. Theorem 1 (Singular Value Expansion) Let be from H1 and let k be a kernel from H3 which denes a Hilbert-Schmidt operator T : H H k 1 → 2

(Tk) (x) = f(x) = k(x, z) (z) dz . (41) ZZ

29 Then

f 2 = T T , (42) k kH2 h k k iH1 where Tk is the adjoint operator of Tk and there exists an expansion

k(x, z) = sn en(z) gn(x) (43) n X 2 which converges in the L -sense. The sn 0 are the singular values of Tk, and e H , g H are the corresponding orthonormal functions. For = n ∈ 1 n ∈ 2 X Z and symmetric, positive denite kernel k, we obtain the eigenfunctions en = gn of Tk with corresponding eigenvalues sn. Proof. From f = Tk we obtain f 2 = T , T = T T , . (44) k kH2 h k k iH2 h k k iH1 The singular value expansion of Tk is

T = s , e g (45) k nh niH1 n n X (see Werner, 2000, Theorem VI.3.6). The values sn are the singular values of T for the orthonormal systems e on H and g on H . We dene k { n} 1 { n} 2 r := T e , g = s , (46) nm h k n miH2 nm m where the last “=” results from eq. (45) for = en. The sum

r2 = ( T e , g )2 T e 2 < (47) nm h k n miH2 k k nkH2 ∞ m m X X converges because of Bessel’s inequality (the -sign). Next we complete the or- thonormal system (ONS) en to an orthonormal basis (ONB) e˜l by adding an ONB of the kernel ker ({T )} of the operator T to the ONS e{ }. The func- k k { n} tion H1 possesses an unique representation through this basis: = , e˜ ∈ e˜ . We obtain lh liH1 l P T = , e˜ T e˜ , (48) k h liH1 k l Xl where we used that Tk is continuous. Because Tke˜l = 0 for all e˜l ker (Tk), the image T can be expressed through the ONS e : ∈ k { n} T = , e T e (49) k h niH1 k n n X = , e T e , g g = r , e g . h niH1 h k n miH2 m nmh niH1 m n à m ! n,m X X X

30 Here we used the fact that gm is an ONB of the range of Tk and, therefore, T e = T e , g g{ . } k n mh k n miH2 m Because the set of functions en gm are an ONS in H3 (which can be completed to an ONB)P and r2 {< }(cf. eq. (47)), the kernel n,m nm ∞

k˜(z, x) := P rnm en(z) gm(x) (50) n,m X is from H3. We observe that the induced Hilbert-Schmidt operator Tk˜ is equal to Tk:

T˜ (x) = r , e g (x) = (T )(x) , (51) k nmh niH1 m k n,m ¡ ¢ X where the rst “=”-sign follows from eq. (50) and the second “=”-sign from eq. (49). Hence, the kernel k and kernel k˜ are equal except for a set with ˜ zero measure, i.e. k = k. We obtain from eq. (46) Tkel, gt H1 = lt sl and T e , g = r from eq. (51), and, therefore, hr = i s . Inserting h k l tiH1 lt lt lt l rnm = nmsn into eq. (50) proves eq. (43) in the theorem. The last statement of the theorem follows from the fact that Tk = 1/2 | | (Tk Tk) = Tk (Tk is positive and selfadjoint) and, therefore, en = gn (Werner, 2000, proof of Theorem VI.3.6, p. 246 top). ¥ As a consequence of this theorem, we can for nite dene a mapping ω of “row” objects z and a mapping “column” objects Zx into a common feature space where k is a dot product.

(x) := (s1g1(x), s2g2(x), . . .) , ω (z) := (e1(z), e2(z), . . .) , (x), ω (z) = s e (z) g (x) = k(z, x) . (52) h i n n n n X Note, that nite ensures that ω (z) , ω (z) converges. From eq. (49) we obtain for the classicationZ or regressionh functioni f(x) = s , e g (x) . (53) n h niH1 n n X It is well dened because sets of zero measure vanish through integration in eq. (39), which is conrmed through expansion eq. (53), where the zero measure is

“absorbed” in the terms , en H1 . To obtain absolute and uniform convergence h i 2 2 of the sum for f(x), we must enforce k(x, ) H1 K as can be seen in the following corollary. k k Corollary 1 (Linear Classication in `2) Let the assumptions of Theorem 1 hold and let (k(x, z))2 dz K2 for all Z x . We dene w := ( , e1 H1 , , e2 H1 , . . .), and (x) := (s1g1(x), s2g2(x), . . .). ∈ X 2 h 2i h 2i R 2 2 Then w , where 2 and 2 , and the fol- , (x) ` w ` H1 (x) ` K lowing sum conver∈ gences absolutelyk k andk kuniformly:k k

f(x) = w, (x) 2 = s , e g (x) . (54) h i` n h niH1 n n X

31 Proof. First we show that (x) `2: ∈ 2 2 2 (x) 2 = (s g (x)) = ((T e )(x)) (55) k k` n n k n n n X X = ( k(x, .), e )2 k(x, .) 2 sup (k(x, z))2 dz K2 , h niH1 k kH1 { } n x ∈ X Z X Z where we used Bessel’s inequality for the rst ” ”, the supremum over x for the second ” ” (the supremum exists because (k(x, z))2 dz is a bounded∈ X subset of R), and the assumption of the corollary{ for the last ”} ”. To prove 2 2 R w 2 we use again Bessel’s inequality: k k` k kH1 2 2 2 w 2 = ( , e ) . (56) k k` h niH1 k kH1 n X Finally, we prove that the sum

f(x) = w, (x) 2 = s , e g (x) (57) h i` n h niH1 n n X converges absolutely and uniformly. The fact that the sum convergences in the L2 sense follows directly from the singular value expansion of Theorem 1. We now chose an m N with ∈ ∞ 2 ( , e )2 (58) h niH1 K n=m X ³ ´ for > 0 (because of eq. (56) such an m exists), and we apply the Cauchy- Schwarz inequality

1 1 ∞ ∞ 2 ∞ 2 s , e g (x) (s g (x))2 ( , e )2 | n h niH1 n | n n h niH1 n=m Ãn=m ! Ãn=m ! X X X K = , K where we used inequalities eqs. (55) and (58). Because m is independent of x, the convergence is absolutely and uniformly, too. ¥ Eq. (39) or, equivalently, (54) is a linear classication or regression function in `2. We nd that the expansion of the classier f converges absolutely and uniformly and, therefore, that f is continuous. This can be seen because en are 1 2 eigenfunctions of the compact, positive, self-adjoint operator (Tk Tk) and gn are isometric images of en (see Werner, 2000, Theorem VI.3.6 and text before Theorem VI.4.2). Hence, the orthonormal functions gn are continuous. We also justied the analysis in eq. (28) and derivatives of gm with respect to x. Relation of the theoretical considerations to the P-SVM

32 L i P j j Using (x) = i=1 x , (z) = j=1 z , and j := z , we obtain P P ¡ ¢ P ¡ P¢ ¡ ¢ j j f(x) = j k x, z = (x) , j ω z , j=1 * j=1 + X X 1 ¡ 2¢ L ¡ ¢ X = x , x , . . . , x , 1 2 P Zω = ¡ω ¡z ¢ , ω ¡z ¢, . . . , ω ¡z ¢¢ , P ¡ ¡ ¢ ¡j ¢ ¡ ¢¢ w = i ω z (expansion into support vectors), j=1 X ¡ ¢ i j j i i j Kij = x , ω z = sn en z gn x = k x , z , n ­ >¡ ¢ ¡ ¢® X ¡ ¢ ¡ ¢ ¡ ¢ K = X Zω , and f 2 = >K>K = X> w 2 (the objective function). (59) k kH2 k k2 Note, that in the P-SVM formulation w is not unique with respect to the zero- subspace of the matrix X. Here we obtain the related result that w is not unique with respect to the subspace which is mapped to the zero function by 2 Tk. Interestingly, we recovered the new objective function eq. (4) as the L - 2 norm f H2 on the classication function. This, again, motivates the use of the newk kobjective function as a capacity measure. We also nd that the pri- mal problem of the P-SVM (eq. (21)) corresponds to the formulation in H2, while the dual (eq. (26)) corresponds to the formulation in H1. Primal and dual P-SVM formulations can be transformed into each other via the property T , T = T T , . h k k iH2 h k k iH1 References

P. Ahlgren, B. Jarneving, and R. Rousseau. Requirements for a cocitation similarity measure with special reference to pearson’s correlation coecient. Journal of the American Society for Information Science and Technology, 54: 550–560, 2003. W. Bains and G. Smith. A novel method for nucleic acid sequence determination. Journal of Theoretical Biology, 135:303–307, 1988. A. E. Bayer, J. C. Smart, and G. W. McLaughlin. Mapping intellectual structure of a scientic subeld through author cocitations. Journal of the American Society for Information Science, 41(6):444–452, 1990. Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. B, 57(1):289–300, 1995. Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165–1188, 2001.

33 A. Blum and P. Langley. Selection of relevant features and examples in . Articial Intelligence, 97:245–271, 1997. A. Califano, G. Stolovitzky, and Y. Tu. Analysis of gene expression microar- rays for phenotype classication. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 75–85, 1999. W. Chu, S. S. Keerthi, and C. J. Ong. Bayesian support vector regression using a unied loss function. IEEE Transactions on Neural Networks, 2004. To appear. T. Cremer, A. Kurz, R. Zirbel, S. Dietzel, B. Rinke, E. Schrock, M. R. Speichel, U. Mathieu, A. Jauch, P. Emmerich, H. Schertan, T. Ried, C. Cremer, and P. Lichter. Role of chromosome territories in the functional compartmen- talization of the cell nucleus. Cold Spring Harbor Symp. Quant. Biol., 58: 777–792, 1993. T. G. Dietterich. Approximate statistical test for comparing supervised classi- cation learning algorithms. Neural Computation, 10:1895–1923, 1998. R. Drmanac, I. Labat, I. Brukner, and R. Crkvenjakov. Sequencing of megabase plus DNA by hybridization: theory of the method. Genomics, 4:114–128, 1989. L. Falquet, M. Pagni, P. Bucher, N. Hulo, C. J. Sigrist, K. Hofmann, and A. Bairoch. The PROSITE database, its status in 2002. Nucleic Acids Re- search, 30:235–238, 2002. T. Graepel, R. Herbrich, P. Bollmann-Sdorra, and K. Obermayer. Classication on pairwise proximity data. In M. S. Kearns, S. A. Solla, and D. A. Cohn, ed- itors, Advances in Neural Information Processing Systems, volume 11, pages 438–444. MIT Press, Cambridge, MA, 1999. I. Guyon and A. Elissee. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003. Special Issue on Variable and Feature Selection. R. Herbrich, T. Graepel, P. Bollmann-Sdorra, and K. Obermayer. Learning a preference relation in IR. In Proceedings Workshop Text Categorization and Machine Learning, International Conference on Machine Learning 1998, pages 80–84, 1998. L. J. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: identi- cation and analysis of coexpressed genes. Genome Research, 11:1106–1115, 1999. S. Hochreiter, M. C. Mozer, and K. Obermayer. Coulomb classiers: Gener- alizing support vector machines via an analogy to electrostatic systems. In S. Beckers, S. Thrun, and K. Obermayer, editors, Advances in Neural Infor- mation Processing Systems 15, pages 545–552. MIT Press, Cambridge, MA, 2003.

34 S. Hochreiter and K. Obermayer. Classication, regression, and feature selection on matrix data. Technical Report 2004/2, Technische Universitat Berlin, Fakultat fur Elektrotechnik und Informatik, 2004a. S. Hochreiter and K. Obermayer. Gene selection for microarray data. In B. Scholkopf, K. Tsuda, and J.-P. Vert, editors, Kernel Methods in Com- putational Biology, pages 319–355. MIT Press, 2004b.

S. Hochreiter and K. Obermayer. Sphered support vector machine. Techni- cal report, Technische Universitat Berlin, Fakultat fur Elektrotechnik und Informatik, 2004c.

S. Hochreiter and K. Obermayer. Nonlinear feature selection with the potential support vector machine. In I. Guyon, S. Gunn, M. Nikravesh, and L. Zadeh, editors, Feature extraction, Foundations and Applications. Springer, 2005.

P. D. Ho. Bilinear mixed-eects models for dyadic data. Journal of the Amer- ican Statistical Assosciation, 100(469):286–295, 2005.

T. Hofmann and J. M. Buhmann. Pairwise data clustering by deterministic an- nealing. IEEE Transactions on Pattern Analysis and Machine Intelligence”, 19(1):1–25, 1997.

T. Hofmann and J. Puzicha. Unsupervised learning from dyadic data. Technical Report TR-98-042, International Computer Science Insitute, Berkeley, CA, 1998.

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the Association for Computing Machinery, 46(5):604–632, 1999.

R. Kohavi and G. H. John. Wrappers for feature subset selection. Articial Intelligence, 97(1-2):273–324, 1997.

H. Li and E. Loken. A unied theory of statistical analysis and inference for variance component models for dyadic data. Statistica Sinica, 12:519–535, 2002.

D. Lipman and W. Pearson. Rapid and sensitive protein similarity searches. Science, 227:1435–1441, 1985.

Q. Lu, L. L. Wallrath, and S. C. R. Elgin. Nucleosome positioning and gene regulation. Journal of Cellular Biochemistry, 55:83–92, 1994.

Y. Lysov, V. Florent’ev, A. Khorlin, K. Khrapko, V. Shik, and A. Mirzabekov. DNA sequencing by hybridization with oligonucleotides. Doklady Academy Nauk USSR, 303:1508–1511, 1988.

O. L. Mangasarian. Generalized support vector machines. Technical Report 98-14, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, 1998.

35 C. B. Mazza, N. Sukumar, C. M. Breneman, and S. M. Cramer. Prediction of protein retention in ion-exchange systems using molecular descriptors ob- tained from crystal structure. Anal. Chem., 73:5457–5461, 2001. S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller. Fisher discrimi- nant analysis with kernels. In Y.-H. Hu, J. Larsen, E. Wilson, and S. Douglas, editors, Neural Networks for Signal Processing IX, pages 41–48. IEEE, 1999. S. L. Pomeroy, P. Tamayo, M. Gaasenbeek, L. M. Sturla, M. Angelo, M. E. McLaughlin, J. Y. H. Kim, L. C. Goumnerova, P. M. Black, C. Lau, J. C. Allen, D. Zagzag, J. M. Olson, T. Curran, C. Wetmore, J. A. Biegel, T. Pog- gio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D. N. Louis, J. P. Mesirov, E. S. Lander, and T. R. Golub. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870): 436–442, 2002. G. Ratsch, T. Onoda, and K.-R. Muller. Soft margins for AdaBoost. Machine Learning, 42(3):287–320, 2001. Also: NeuroCOLT Technical Report 1998-021. G. Salton. Automatic Information Organization and Retrieval. McGraw-Hill, New York, 1968. U. Scherf, D. T. Ross, M. Waltham, L. H. Smith, J. K. Lee, L. Tanabe, K. W. Kohn, W. C. Reinhold, T. G. Myers, D. T. Andrews, D. A. Scudiero, M. B. Eisen, E. A. Sausville, Y. Pommier, D. Botstein, P. O. Brown, and J. N. Weinstein. A gene expression database for the molecular pharmacology of cancer. Nature Genetics, 24(3):236–244, 2000. B. Scholkopf, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Generaliza- tion bounds via eigenvalues of the gram matrix. Technical Report NC2-TR- 1999-035, NeuroCOLT2, 1999. B. Scholkopf and A. J. Smola. Learning with kernels – Support Vector Machines, Reglarization, Optimization, and Beyond. MIT Press, Cambridge, 2002. J. Shawe-Taylor, P. L. Bartlett, R. Williamson, and M. Anthony. A framework for structural risk minimization. In Proceedings of the 9th Annual Conference on Computational Learning Theory, pages 68–76, New York, 1996. Associa- tion for Computing Machinery. J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anhtony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940, 1998. M. A. Shipp, K. N. Ross, P. Tamayo, A. P. Weng, R. C. T. Aguiar J. L. Kutok, M. Gaasenbeek, M. Angelo, M. Reich, T. S. Ray G. S. Pinkus, M. A. Koval, K. W. Last, A. Norton, J. Mesirov T. A. Lister, D. S. Neuberg, E. S. Lander, J. C. Aster, and T. R. Golub. Diuse large B-cell lymphoma outcome pre- diction by gene-expression proling and supervised machine learning. Nature Medicine, 8(1):68–74, 2002.

36 C. J. Sigrist, L. Cerutti, N. Hulo, A. Gattiker, L. Falquet, M. Pagni, A. Bairoch, and P. Bucher. PROSITE: A documented database using patterns and proles as motif descriptors. Brief , 3:265–274, 2002. E. Southern. United Kingdom patent application GB8810400, 1988.

L. J. van’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression proling predicts clinical outcome of breast cancer. Nature, 415:530–536, 2002.

V. N. Vapnik. Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley, New York, 1998.

D. Werner. Funktionalanalysis. Springer, Berlin, 3. edition, 2000.

H. D. White and K. W. McCain. Bibliometrics. Annual Review of Information Science and Technology, 24:119–186, 1989.

37