Marginalized Kernels Between Labeled Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Marginalized Kernels Between Labeled Graphs Hisashi Kashima [email protected] IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, 242-8502 Kanagawa, Japan Koji Tsuda [email protected] Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T¨ubingen,Germany; and AIST Computational Biology Research Center, 2-43 Aomi, Koto-ku, 135-0064 Tokyo, Japan Akihiro Inokuchi [email protected] IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, 242-8502 Kanagawa, Japan Abstract of drugs from their chemical structures (Kramer & De Raedt, 2001; Inokuchi et al., 2000). A new kernel function between two labeled graphs is presented. Feature vectors are de- Kernel methods such as support vector machines are fined as the counts of label paths produced becoming increasingly popular for their high perfor- by random walks on graphs. The kernel com- mance (Sch¨olkopf & Smola, 2002). In kernel meth- putation finally boils down to obtaining the ods, all computations are done via a kernel function, stationary state of a discrete-time linear sys- which is the inner product of two vectors in a fea- tem, thus is efficiently performed by solv- ture space. In order to apply kernel methods to graph ing simultaneous linear equations. Our ker- classification, we first need to define a kernel func- nel is based on an infinite dimensional fea- tion between the graphs. However, defining a ker- ture space, so it is fundamentally different nel function is not an easy task, because it must be from other string or tree kernels based on dy- designed to be positive semidefinite2. Following the namic programming. We will present promis- pioneering work by Haussler (1999), a number of ker- ing empirical results in classification of chem- nels were proposed for structured data, for example, ical compounds.1 Watkins (2000), Jaakkola et al. (2000), Leslie et al. (2003), Lodhi et al. (2002), and Tsuda et al. (2002) for sequences, and Vishwanathan and Smola (2003), 1. Introduction Collins and Duffy (2001), and Kashima and Koyanagi (2002) for trees. Most of them are based on the idea of A large amount of the research in machine learning is an object decomposed into substructures (i.e. subse- concerned with classification and regression for real- quences, subtrees or subgraphs) and a feature vector valued vectors (Vapnik, 1998). However, much of is composed of the counts of the substructures. As the real world data is represented not as vectors, but the dimensionality of feature vectors is typically very as graphs including sequences and trees, for exam- high, they deliberately avoid explicit computations of ple, biological sequences (Durbin et al., 1998), nat- feature values, and adopt efficient procedures such as ural language texts (Manning & Sch¨utze,1999), semi- dynamic programming or suffix trees. structured data such as HTML and XML (Abiteboul et al., 2000), and so on. Especially, in the pharmaceu- In this paper, we will construct a kernel function be- tical area, the chemical compounds are represented as tween two graphs, which is distinguished from the ker- labeled graphs, and their automatic classification is of nel between two vertices in a graph, e.g., diffusion ker- crucial importance in the rationalization of drug dis- nels (Kondor & Lafferty, 2002; Kandola et al., 2003; covery processes to predict the effectiveness or toxicity Lafferty & Lebanon, 2003) or the kernel between two paths in a graph, e.g., path kernels (Takimoto & War- 1An extended abstract of this research has been pre- muth, 2002). There has been almost no significant sented in IEEE ICDM International Workshop on Active works for designing kernels between two graphs except Mining at Maebashi, Japan (Kashima & Inokuchi, 2002). But this full paper contains much richer theoretical anal- 2Ad hoc similarity functions are not always positive ysises such as interpretation as marginalized kernels, con- semidefinite, e.g. Shimodaira et al. (2002) and Bahlmann vergence conditions, relationship to linear systems, and so et al. (2002). on. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. for the work by G¨artner(2002). We will discuss about 2. Marginalized Kernels his kernels in Appendix. A common way for constructing a kernel for struc- One existing way to describe a labeled graph as a tured data such as strings, trees, and graphs is to as- feature vector is to count label paths appearing in sume hidden variables and make use of the probability the graph. For example, the pattern discovery algo- distribution of visible and hidden variables (Haussler, rithm (Kramer & De Raedt, 2001; Inokuchi et al., 1999; Watkins, 2000; Tsuda et al., 2002). For exam- 2000) explicitly constructs a feature vector from the ple, Watkins (2000) proposed conditional symmetric counts of frequently appearing label paths in a graph. independence (CSI) kernels which are described as For the labeled graph shown in Figure 1, a label path X is produced by traversing the vertices, and looks like K(x; x0) = p(xjh)p(x0jh)p(h); h (A; e; A; d; D; a; B; c; D); where h is a hidden variable, and x and x0 are visible where the vertex labels A; B; C; D and the edge labels variables which correspond to structured data. This a; b; c; d; e appear alternately. Numerous label paths kernel requires p(xjh), which means that the genera- are produced by traversing the nodes in every possi- tion process of x from h needs to be known. However, ble way. Especially when the graph has a loop, the it may be the case that p(hjx) is known instead of dimensionality of the count vector is infinite, because p(xjh). Then we can use a marginalized kernel (Tsuda traversing may never end. In order to explicitly con- et al., 2002) which is described as struct the feature vector, we have to select features to X X 0 0 0 0 keep the dimensionality finite. The label paths may be K(x; x ) = Kz(z; z )p(hjx)p(h jx ): (1) selected simply by limiting the path length or, more h h0 intelligently, the pattern discovery algorithm identifies 0 the label paths which appear frequently in the training where z = [x; h] and Kz(z; z ) is the joint kernel de- set of graphs. In any case, there is an additional un- pending on both visible and hidden variables. The wanted parameter (i.e. a threshold on path length or posterior probability p(hjx) can be interpreted as a path frequency) that has to be determined by a model feature extractor that extracts informative features for selection criterion (e.g. the cross validation error). classification from x. The marginalized kernel (1) is defined as the expectation of the joint kernel over all We will propose a kernel function based on the in- 0 ner product between infinite dimensional path count possible values of h and h . In the following, we vectors. A label path is produced by random walks on will construct a graph kernel in the context of the graphs, and thus is regarded as a random variable. Our marginalized kernel. The hidden variable is a sequence kernel is defined as the inner product of the count vec- of vertex indices, which is generated by random walks tors averaged over all possible label paths, which is re- on the graph. Also the joint kernel Kz is defined as a garded as a special case of marginalized kernels (Tsuda kernel between the sequences of vertex and edge labels et al., 2002). The kernel computation boils down to traversed in the random walk. finding the stationary state of a discrete-time linear system (Rugh, 1995), which can be done efficiently by 3. Graph Kernel solving simultaneous linear equations with a sparse co- efficient matrix. We especially notice that this compu- In this section, we introduce a new kernel between tational trick is fundamentally different from the dy- graphs with vertex labels and edge labels. At the be- namic programming techniques adopted in other ker- ginning, let us formally define a labeled graph. Denote nels (e.g. Watkins (2000), Lodhi et al. (2002), Collins by G a labeled directed graph and by jGj the number and Duffy (2001), Kashima and Koyanagi (2002)). of vertices. Each vertex of the graph is uniquely in- These kernels deal with very large but still finite di- dexed from 1 to jGj. Let vi 2 Σv denote the label mensional feature spaces and have parameters to con- of vertex i and eij 2 ΣE denote the label of the edge strain the dimensionality (e.g. the maximum length from i to j. Figure 1 shows an example of the graphs of subsequences in Lodhi et al. (2002)). In order to that we handle in this paper. We assume that there investigate how our kernel performs well on the real are no multiple edges between any vertices. Our task 0 data, we will show promising results on predicting the is to construct a kernel function K(G; G ) between two 0 properties of chemical compounds. labeled graphs G and G . This paper is organized as follows. In Section 2, we 3.1. Random Walks on Graphs review marginalized kernels as the theoretical founda- tion. Then, a new kernel between graphs is presented The hidden variable h = (h1; : : : ; h`) associated with in Section 3. In Section 4, we summarize the results of graph G is a sequence of natural numbers from 1 to our experiments on the classification of chemical com- jGj. Given a graph G, h is generated by a random walk pounds. Finally, we conclude with Section 5. as follows: At the first step, h1 is sampled from the © © would be a natural choice (Sch¨olkopf & Smola, 2002).