Support Vector Machines for Dyadic Data

Support Vector Machines for Dyadic Data Sepp Hochreiter and Klaus Obermayer Department of Electrical Engineering and Computer Science Technische UniversitÄat Berlin 10587 Berlin, Germany fhochreit,[email protected] Abstract We describe a new technique for the analysis of dyadic data, where two sets of objects (\row" and \column" objects) are characterized by a matrix of numerical values which describe their mutual relationships. The new technique, called \Potential Support Vector Machine" (P-SVM), is a large-margin method for the construction of classi¯ers and regression functions for the \column" objects. Contrary to standard support vector machine approaches, the P-SVM minimizes a scale-invariant capacity measure and requires a new set of constraints. As a result, the P-SVM method leads to a usually sparse expansion of the classi¯cation and regression functions in terms of the \row" rather than the \column" objects and can handle data and kernel matrices which are neither positive de¯nite nor square. We then describe two complementary regularization schemes. The ¯rst scheme improves generalization performance for classi¯cation and regression tasks, the second scheme leads to the selection of a small, informative set of \row" \support" objects and can be applied to feature selection. Benchmarks for classi¯cation, regression, and feature selection tasks are performed with toy data as well as with several real world data sets. The results show, that the new method is at least competitive with but often performs better than the benchmarked standard methods for standard vectorial as well as for true dyadic data sets. In addition, a theoretical justi¯cation is provided for the new approach. 1 Introduction Learning from examples in order to predict is one of the standard tasks in machine learning. Many techniques have been developed to solve classi¯cation and regression problems, but by far most of them were speci¯cally designed for vectorial data. Vectorial data, where data objects are described by vectors of numbers which are treated as elements of a vector space, are very convenient, because of the structure imposed by the Euclidean metric. However, there are 1 datasets for which a vector-based description is inconvenient or simply wrong, and representations which consider relationships between objects, are more ap- propriate. A) a b c d B) a b c d e f g α 1.3 −2.2 −1.6 7.8 a 0.9 −0.1 −0.8 0.5 0.2 −0.5 −0.7 β −1.8 −1.1 7.2 2.3 b −0.1 0.9 0.6 0.3 −0.7 −0.6 0.3 χ 1.2 1.9 −2.9 −2.2 c −0.8 0.6 0.9 0.2 −0.6 0.6 0.5 δ 3.7 0.8 −0.6 2.5 d 0.5 0.3 0.2 0.9 0.7 0.1 0.3 ε 9.2 −9.4 −8.3 9.2 e 0.2 −0.7 −0.6 0.7 0.9 −0.9 −0.5 φ −7.7 8.6 −9.7 −7.4 f −0.5 −0.6 0.6 0.1 −0.9 0.9 0.9 γ −4.8 0.1 −1.2 0.9 g −0.7 0.3 0.5 0.3 −0.5 0.9 0.9 Figure 1: A) Dyadic data: Column objects a; b; : : : and row objects ®; ¯; : : : are represented by a matrix of numerical fvalues whicg h describe theirf mutualg relationships. B) Pairwise data: A special case of dyadic data, where row and column objects coincide. In the following we will study representations of data objects which are based on matrices. We consider \column" objects, which are the objects to be described, and \row" objects, which are the objects which serve for their description (cf. Fig. 1A). The whole dataset can then be represented using a rectangular matrix whose entries denote the relationships between the corre- sponding \row" and \column" objects. We will call representations of this form dyadic data (Hofmann and Puzicha, 1998; Li and Loken, 2002; Ho®, 2005). If \row" and \column" objects are from the same set (Fig. 1B), the representa- tion is usually called pairwise data, and the entries of the matrix can often be interpreted as the degree of similarity (or dissimilarity) between objects. Dyadic descriptions are more powerful than vector-based descriptions, as vectorial data can always be brought into dyadic form when required. This is often done for kernel-based classi¯ers or regression functions (SchÄolkopf and Smola, 2002; Vapnik, 1998), where a Gram matrix of mutual similarities is calculated before the predictor is learned. A similar procedure can also be used in cases where the \row" and \column" objects are from di®erent sets. If both of them are described by feature vectors, a matrix can be calculated by applying a kernel function to pairs of feature vectors, one vector describing a \row" and the other vector describing a \column" object. One example for this is the drug-gene matrix of Scherf et al. (2000), which was constructed as the product of a measured drug-sample and a measured sample-gene matrix and where the kernel function was a scalar product. 2 In many cases, however, dyadic descriptions emerge, because the matrix entries are measured directly. Pairwise data representations as a special case of dyadic data can be found for datasets where similarities or distances between objects are measured. Examples include sequential or biophysical similarities between proteins (Lipman and Pearson, 1985; Sigrist et al., 2002; Falquet et al., 2002), chromosome location or co-expression similarities between genes (Cre- mer et al., 1993; Lu et al., 1994; Heyer et al., 1999), co-citations of text documents (White and McCain, 1989; Bayer et al., 1990; Ahlgren et al., 2003), or hyperlinks between web-pages (Kleinberg, 1999). In general, these measured matrices are symmetric but may not be positive de¯nite, and even if they are for the training set, they may not remain positive de¯nite, if new examples are included. Genuine dyadic data occur whenever two sets of objects are related. Examples are DNA microarray data (Southern, 1988; Lysov et al., 1988; Dr- manac et al., 1989; Bains and Smith, 1988), where the \column" objects are tissue samples, the \row" objects are genes, and every sample-gene pair is related by the expression level of this particular gene in this particular sample. Other examples are web-documents, where both the \column" and \row" objects are web-pages and \column" web-pages are described by either hyperlinks to or from the \row" objects, which give rise to a rectangular matrix1. Fur- ther examples include documents in a database described by word-frequencies (Salton, 1968) or molecules described by transferable atom equivalent (TAE) descriptors (Mazza et al., 2001). Traditionally, \row" objects have often been called \features" and \column" vectors of the dyadic data matrix have mostly been treated as \feature vectors" which live in an Euclidean vector space, even when the dyadic nature the data was made explicit, see (Graepel et al., 1999; Mangasarian, 1998) or the \feature map" method (SchÄolkopf and Smola, 2002). Di±culties, however, arise when features are heterogeneous, and \apples and oranges" must be compared. What theoretical arguments would, for example, justify, to treat the values of a set of TAE descriptors as coordinates of a vector in an Euclidean space? A \non-vectorial" approach to pairwise data is to interpret the data matrix as a Gram matrix and to apply support vector machines (SVM) for classi¯cation and regression if the data matrix is positive semide¯nite (Graepel et al., 1999). For inde¯nite (but symmetric) matrices two other non-vectorial approaches have been suggested (Graepel et al., 1999). In the ¯rst approach, the data matrix is projected into the subspace spanned by the eigenvectors with positive eigenvalues. This is an approximation and one would expect good results only, if the absolute values of the negative eigenvalues are small compared to the dominant positive ones. In the second approach, directions of negative eigenvalues are processed by ipping the sign of these eigenvalues. All three approaches lead to positive semide¯nite matrices on the training set, but positive semide¯niteness is not assured, if a new test object must be included. An embedding approach was suggested by Herbrich et al. (1998) for antisymmetric matrices, but this was 1Note, that for pairwise data examples the linking matrix was symmetric because links were considered bidirectional. 3 speci¯cally designed for data sets, where the matrix entries denote preference relations between objects. In summary, no general and principled method exists to learn classi¯ers or regression functions for dyadic data. In order to avoid abovementioned shortcomings, we suggest to consider \column" and \row" objects on an equal footing and interpret the matrix entries as the result of a kernel function or measurement kernel, which takes a \row" object, applies it to a \column" object, and outputs a number. It will turn out that mild conditions on this kernel su±ce to create a vector space endowed with a dot product into which the \row" and the \column" objects can be mapped. Using this mathematical argument as a justi¯cation, we show how to construct classi¯cation and regression functions in this vector space in analogy to the large margin based methods for learning perceptrons for vectorial data. Using an im- proved measure for model complexity and a new set of constraints which ensure a good performance on the training data, we arrive at a generally applicable method to learn predictors for dyadic data.

Support Vector Machines for Dyadic Data

Machine Learning: Unsupervised Methods Sepp Hochreiter Other Courses

Prof. Dr. Sepp Hochreiter

Evaluating Recurrent Neural Network Explanations

Employing a Transformer Language Model for Information Retrieval and Document Classification

An SMO Algorithm for the Potential Support Vector Machine

A Deep Learning Architecture for Conservative Dynamical Systems: Application to Rainfall-Runoff Modeling

LONG SHORT TERM MEMORY NEURAL NETWORK for KEYBOARD GESTURE DECODING Ouais Alsharif, Tom Ouyang, Françoise Beaufays, Shumin Zhai

Are Pre-Trained Language Models Aware of Phrases?Simplebut Strong Baselinesfor Grammar Induction

Which Neural Net Architectures Give Rise to Exploding and Vanishing Gradients?

Rectified Factor Networks for Biclustering

Blimp: the Benchmark of Linguistic Minimal Pairs for English

Revisit Long Short-Term Memory: an Optimization Perspective