<<
Home , VH2

Marginalized Kernels Between Labeled Graphs

Hisashi Kashima [email protected] IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, 242-8502 Kanagawa, Japan Koji Tsuda [email protected] Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T¨ubingen,Germany; and AIST Computational Biology Research Center, 2-43 Aomi, Koto-ku, 135-0064 Tokyo, Japan Akihiro Inokuchi [email protected] IBM Tokyo Research Laboratory, 1623-14 Shimotsuruma, Yamato-shi, 242-8502 Kanagawa, Japan

Abstract of drugs from their chemical structures (Kramer & De Raedt, 2001; Inokuchi et al., 2000). A kernel function between two labeled graphs is presented. Feature vectors are de- Kernel methods such as support vector machines are fined as the counts of label paths produced becoming increasingly popular for their high perfor- by random walks on graphs. The kernel com- mance (Sch¨olkopf & Smola, 2002). In kernel meth- putation finally boils down to obtaining the ods, all computations are done via a kernel function, stationary state of a discrete-time linear sys- which is the inner product of two vectors in a fea- tem, thus is efficiently performed by solv- ture space. In order to apply kernel methods to graph ing simultaneous linear equations. Our ker- classification, we first need to define a kernel func- nel is based on an infinite dimensional fea- tion between the graphs. However, defining a ker- ture space, so it is fundamentally different nel function is not an easy task, because it must be from other string or tree kernels based on dy- designed to be positive semidefinite2. Following the namic programming. We will present promis- pioneering work by Haussler (1999), a number of ker- ing empirical results in classification of chem- nels were proposed for structured data, for example, ical compounds.1 Watkins (2000), Jaakkola et al. (2000), Leslie et al. (2003), Lodhi et al. (2002), and Tsuda et al. (2002) for sequences, and Vishwanathan and Smola (2003), 1. Introduction Collins and Duffy (2001), and Kashima and Koyanagi (2002) for trees. Most of them are based on the idea of A large amount of the research in machine learning is an object decomposed into substructures (i.e. subse- concerned with classification and regression for real- quences, subtrees or subgraphs) and a feature vector valued vectors (Vapnik, 1998). However, much of is composed of the counts of the substructures. As the real world data is represented not as vectors, but the dimensionality of feature vectors is typically very as graphs including sequences and trees, for exam- high, they deliberately avoid explicit computations of ple, biological sequences (Durbin et al., 1998), nat- feature values, and adopt efficient procedures such as ural language texts (Manning & Sch¨utze,1999), semi- dynamic programming or suffix trees. structured data such as HTML and XML (Abiteboul et al., 2000), and so on. Especially, in the pharmaceu- In this paper, we will construct a kernel function be- tical area, the chemical compounds are represented as tween two graphs, which is distinguished from the ker- labeled graphs, and their automatic classification is of nel between two vertices in a graph, e.g., diffusion ker- crucial importance in the rationalization of drug dis- nels (Kondor & Lafferty, 2002; Kandola et al., 2003; covery processes to predict the effectiveness or toxicity Lafferty & Lebanon, 2003) or the kernel between two paths in a graph, e.g., path kernels (Takimoto & War- 1An extended abstract of this research has been pre- muth, 2002). There has been almost no significant sented in IEEE ICDM International Workshop on Active works for designing kernels between two graphs except Mining at Maebashi, Japan (Kashima & Inokuchi, 2002). But this full paper contains much richer theoretical anal- 2Ad hoc similarity functions are not always positive ysises such as interpretation as marginalized kernels, con- semidefinite, e.g. Shimodaira et al. (2002) and Bahlmann vergence conditions, relationship to linear systems, and so et al. (2002). on.

Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003. for the work by G¨artner(2002). We will discuss about 2. Marginalized Kernels his kernels in Appendix. A common way for constructing a kernel for struc- One existing way to describe a labeled graph as a tured data such as strings, trees, and graphs is to as- feature vector is to count label paths appearing in sume hidden variables and make use of the probability the graph. For example, the pattern discovery algo- distribution of visible and hidden variables (Haussler, rithm (Kramer & De Raedt, 2001; Inokuchi et al., 1999; Watkins, 2000; Tsuda et al., 2002). For exam- 2000) explicitly constructs a feature vector from the ple, Watkins (2000) proposed conditional symmetric counts of frequently appearing label paths in a graph. independence (CSI) kernels which are described as For the labeled graph shown in Figure 1, a label path X is produced by traversing the vertices, and looks like K(x, x0) = p(x|h)p(x0|h)p(h), h (A, e, A, d, D, a, B, c, D), where h is a hidden variable, and x and x0 are visible where the vertex labels A, B, C, D and the edge labels variables which correspond to structured data. This a, b, c, d, e appear alternately. Numerous label paths kernel requires p(x|h), which means that the genera- are produced by traversing the nodes in every possi- tion process of x from h needs to be known. However, ble way. Especially when the graph has a loop, the it may be the case that p(h|x) is known instead of dimensionality of the count vector is infinite, because p(x|h). Then we can use a marginalized kernel (Tsuda traversing may never end. In order to explicitly con- et al., 2002) which is described as struct the feature vector, we have to select features to X X 0 0 0 0 keep the dimensionality finite. The label paths may be K(x, x ) = Kz(z, z )p(h|x)p(h |x ). (1) selected simply by limiting the path length or, more h h0 intelligently, the pattern discovery algorithm identifies 0 the label paths which appear frequently in the training where z = [x, h] and Kz(z, z ) is the joint kernel de- set of graphs. In any case, there is an additional un- pending on both visible and hidden variables. The wanted parameter (i.e. a threshold on path length or posterior probability p(h|x) can be interpreted as a path frequency) that has to be determined by a model feature extractor that extracts informative features for selection criterion (e.g. the cross validation error). classification from x. The marginalized kernel (1) is defined as the expectation of the joint kernel over all We will propose a kernel function based on the in- 0 ner product between infinite dimensional path count possible values of h and h . In the following, we vectors. A label path is produced by random walks on will construct a graph kernel in the context of the graphs, and thus is regarded as a random variable. Our marginalized kernel. The hidden variable is a sequence kernel is defined as the inner product of the count vec- of vertex indices, which is generated by random walks tors averaged over all possible label paths, which is re- on the graph. Also the joint kernel Kz is defined as a garded as a special case of marginalized kernels (Tsuda kernel between the sequences of vertex and edge labels et al., 2002). The kernel computation boils down to traversed in the random walk. finding the stationary state of a discrete-time linear system (Rugh, 1995), which can be done efficiently by 3. Graph Kernel solving simultaneous linear equations with a sparse co- efficient matrix. We especially notice that this compu- In this section, we introduce a new kernel between tational trick is fundamentally different from the dy- graphs with vertex labels and edge labels. At the be- namic programming techniques adopted in other ker- ginning, let us formally define a labeled graph. Denote nels (e.g. Watkins (2000), Lodhi et al. (2002), Collins by G a labeled directed graph and by |G| the number and Duffy (2001), Kashima and Koyanagi (2002)). of vertices. Each vertex of the graph is uniquely in- These kernels deal with very large but still finite di- dexed from 1 to |G|. Let vi ∈ Σv denote the label mensional feature spaces and have parameters to con- of vertex i and eij ∈ ΣE denote the label of the edge strain the dimensionality (e.g. the maximum length from i to j. Figure 1 shows an example of the graphs of subsequences in Lodhi et al. (2002)). In order to that we handle in this paper. We assume that there investigate how our kernel performs well on the real are no multiple edges between any vertices. Our task 0 data, we will show promising results on predicting the is to construct a kernel function K(G, G ) between two 0 properties of chemical compounds. labeled graphs G and G .

This paper is organized as follows. In Section 2, we 3.1. Random Walks on Graphs review marginalized kernels as the theoretical founda- tion. Then, a new kernel between graphs is presented The hidden variable h = (h1, . . . , h`) associated with in Section 3. In Section 4, we summarize the results of graph G is a sequence of natural numbers from 1 to our experiments on the classification of chemical com- |G|. Given a graph G, h is generated by a random walk pounds. Finally, we conclude with Section 5. as follows: At the first step, h1 is sampled from the

©

©

would be a natural choice (Sch¨olkopf & Smola, 2002).

¤

¤

¢ ¢

¢ The joint kernel is defined as the product of the label

¥

¥

kernels as follows:

¨

¨

¦

¦

¦ §

¦ §

¦ 

£ £ £  0 (` 6= `0)

 Q

`

0 0

¡ ¡ ¡ 0

K (z, z ) = K( , v 0 ) i=2 K(ehi−1hi , e 0 0 )× , ¤

¤ h h h ¤

¤ z 1 i−1 i

 0 0

 K(vhi , v 0 )(` = ` ) hi where z = (G, h). Figure 1. an example of graphs with labels 3.3. Efficient Computation initial probability distribution ps(h). Subsequently, at The marginalized kernel (1) is the expectation of Kz 0 the i-th step, the next vertex hi is sampled subject to over all possible h and h , which is described as the transition probability pt(hi|hi−1), but the random K(G, G0) walk may also end with probability pq(hi−1): X∞ X X Y` |G| X = ps(h1) pt(hi|hi−1)pq(hl) × pt(j|i) + pq(i) = 1. (2) `=1 h h0 i=2 j=1 Y` p0 (h0 ) p0 (h0 |h0 )p0 (h0 ) × The posterior probability for the label path h is de- s 1 t j j−1 q ` scribed as j=2 Y` ` 0 0 0 Y K(vh1 , v 0 ) K(ehk−1hk , e 0 0 )K(vhk , v 0 ), h1 hk−1,hk hk p(h|G) = ps(h1) pt(hi|hi−1)pq(h`), k=2 i=2 P P P where := |G| ··· |G| . The straightforward where ` is the length of h. h h1=1 h`=1 enumeration is obviously impossible, because ` spans When we do not have any prior knowledge, we cab from 1 to infinity. Nevertheless we have an efficient set p to be the uniform distribution, the transition s way to compute this kernel as shown below. By rear- probability p to be an uniform distribution over the t ranging the terms, the kernel has the following nested vertices adjacent to the current vertex, and the ter- structure. mination probability pq to be a constant. Also gaps (Lodhi et al., 2002) can be incorporated by setting the XL X 0 0 transition probabilities appropriately. K(G, G ) = lim s(h1, h1) × (5) L→∞ 0 `=1 h1,h1 3.2. Joint Kernel   X X  0 0  0 0 Next we define the joint kernel Kz assuming that the t(h2, h2, h1, h1) t(h3, h3, h2, h2)× 0 0 0 hidden sequences h and h are given for two graphs G h2,h2 h3,h3 and G0, respectively. When the random walk is done     X as described in h, the traversed labels are listed as   0 0 0   ··· t(h`, h`, h`−1, h`−1)q(h`, h`) ··· 0 h`,h` vh1 eh1h2 vh2 eh2h3 vh3 ··· Assume that two kernel functions, K(v, v0) and where 0 K(e, e ), are readily defined between vertex labels and 0 0 0 0 s(h1, h1) := ps(h1)ps(h1)K(vh1 , vh0 ) edge labels, respectively. We constrain both kernels 1 K(v, v0),K(e, e0) ≥ 0 to be nonnegative3. An example 0 0 0 0 0 of the vertex label kernels is t(hi, hi, hi−1, hi−1) := pt(hi|hi−1)pt(hi|hi−1) × 0 0 0 K(vh , v 0 )K(eh h , eh0 h0 ) K(v, v ) = δ(v = v ), (3) i hi i−1 i i−1 i where δ is a function that returns 1 if its argument 0 0 0 holds, 0 otherwise. If the labels are defined in R, the q(h`, h`) := pq(h`)pq(h`) Gaussian kernel Let us further simplify (5) as K(v, v0) = exp(− k v − v0 k2 /2σ2) (4) XL X 3 0 0 0 This constraint will play an important role in proving K(G, G ) = lim s(h1, h1)r`(h1, h1), L→∞ the convergence of the marginalized kernel in Section 3.4. 0 `=1 h1,h1 where for ` ≥ 2, Computing the kernel requires solving a linear equa- tion with a |G||G0| × |G||G0| coefficient matrix. How- 0 r`(h1, h ) ever, the matrix is actually sparse because the number  1  of non-zero elements is less than c2|G||G0| where c is X X  0 0  0 0 the maximum degree. Therefore, we can employ vari- := t(h2, h2, h1, h1) t(h3, h3, h2, h2)× 0 0 ous kinds of efficient numerical algorithms that exploit h2,h h3,h   2 3   sparsity (Barrett et al., 1994). In our implementation, X we employed a simple iterative method that updates   0 0 0   ··· t(h`, h`, h`−1, h`−1)q(h`, h`) ··· , current solutions by using (8) until convergence start- 0 0 0 h`,h` ing from R1(h1, h1) = r1(h1, h1).

0 0 and r1(h1, h1) := q(h1, h1). Replacing the order of 3.4. Convergence Condition summation, we have the following: The convergence condition needed to justify (9) is de- X XL scribed as follows: 0 0 0 K(G, G ) = s(h1, h1) lim r`(h1, h1) Theorem 1. The infinite sequence L→∞ 0 0 h1,h1 `=1 limL→∞ RL(h1, h1) converges for any h1 ∈ X 0 0 0 0 {1, ··· , |G|} and h1 ∈ {1, ··· , |G |}, if the fol- = s(h1, h1) lim RL(h1, h1), (6) L→∞ lowing inequality holds for i0 ∈ {1, ··· , |G|} and h ,h0 0 1 1 j0 ∈ {1, ··· , |G |},

0 where X|G| X|G | XL t(i, j, i , j )q(i, j) < q(i , j ). (10) 0 0 0 0 0 0 RL(h1, h1) := r`(h1, h1). i=1 j=1 `=1 0 (proof) R (h , h0 ) can also be described as Thus we need to compute R∞(h1, h1) to obtain ` 1 1 0 K(G, G ). R (h , h0 ) = q(h , h0 ) + ` 1 1 1 1  Now let us restate this problem in terms of linear sys- X` X X tem theory (Rugh, 1995). The following recursive re- 0 0  0 0 t(h2, h2, h1, h1) t(h3, h3, h2, h2)× lationship holds between rk and rk−1 (k ≥ 2): 0 0 i=2 h2,h2 h3,h3 X     r (h , h0 ) = t(i, j, h , h0 )r (i, j). (7) X k 1 1 1 1 k−1 ···  t(h , h0 , h , h0 )q(h , h0 ) ···  i,j i i i−1 i−1 i i 0 hi,hi Using (7), the recursive relationship for RL also holds X` as follows: 0 0 := q(h1, h1) + ui(h1, h1), i=2 XT 0 0 0 RL(h1, h1) = r1(h1, h1) + rk(h1, h1) where ui (i ≥ 2) denotes the i-th term of the se- k=2 ries. According to the ratio test, this series con- 0 0 T verges if lim supi→∞ |ui(h1, h1)|/|ui−1(h1, h1)| < 1. X X 0 0 = r (h , h0 ) + t(i, j, h , h0 )r (i, j) Since K(v, v ) and K(e, e ) are nonnegative as previ- 1 1 1 1 1 k−1 0 ously defined, ui(h1, h1) ≥ 0, and thus what we need k=2 i,j 0 0 X to prove is lim sup ui(h1, h )/ui−1(h1, h ) < 1. For 0 0 i→∞ 1 1 = r1(h1, h1) + t(i, j, h1, h1)RL−1(i, j). (8) this purpose, it is sufficient to show i,j 0 ui(h1, h1) 0 < 1 (11) Thus RL is perceived as a discrete-time linear sys- ui−1(h1, h1) tem (Rugh, 1995) evolving as the time L increases. for all i. The two terms in (11) are written as Assuming that RL converges, (see Sec. 3.4 for the con- X 0 0 0 vergence condition), we have the following equilibrium ui(h1, h1) = ai−1(hi−1, hi−1, h1, h1) × (12) equation: 0 hi−1,hi−1 X X 0 0 0 t(h , h0 , h , h0 )q(h , h0 ), R∞(h1, h1) = r1(h1, h1) + t(i, j, h1, h1)R∞(i, j) i i i−1 i−1 i i 0 i,j hi,hi (9) X u (h , h0 ) = a (h , h0 , h , h0 )q(h , h0 ), Therefore, the computation of the marginalized ker- i−1 1 1 i−1 i−1 i−1 1 1 i−1 i−1 h ,h0 nel finally comes down to solving linear simultaneous i−1 i−1 equations (9) and substituting the solutions into (6). (13)

¥

¤

£

¡

¡

¦

where ¦

¦ ¦

§ § ¦

0 0 ¦

ai−1(hi−1, h , h1, h ) := £

i−1 1  £

¦ ¦

§

¦

¢

¡

X X ¦

0 0 0 0 ¦

t(h , h , h , h )  t(h , h , h , h )× ¦

¤

¤ £ 2 2 1 1 3 3 2 2 £ 0 0 h2,h h3,h  2  3 X   0 0 ··· t(hi−2, hi−2, hi−3, hi−3)× Figure 2. A chemical compound is conventionally repre- 0 hi−2,hi−2 sented as an undirected graph (left). Atom types and ! ! bond types correspond to vertex labels and edge labels, 0 0 respectively. The edge labels ’s’ and ’d’ denote single and t(hi−1, hi−1, hi−2, hi−2) ··· . double bonds, respectively. As our kernel assumes a di- rected graph, undirected edges are replaced by directed 0 0 Notice that ai−1(hi−1, hi−1, h1, h1) ≥ 0 as well. Sub- edges (right). stituting (12) and (13) into (11), we have X 0 0 ai−1(hi−1, hi−1, h1, h1) × 4. Experiments 0 hi−1,hi−1 X 0 0 0 We applied our kernel to prediction of the properties t(hi, hi, hi−1, hi−1)q(hi, hi) of chemical compounds. A chemical compound can h ,h0 i i naturally be represented as an undirected graph by X < a (h , h0 , h , h0 )q(h , h0 ). considering the atom types as the vertex labels, e.g. i−1 i−1 i−1 1 1 i−1 i−1 C, Cl and H, and the bond types as the edge labels, h ,h0 i−1 i−1 e.g. s (single bond) and d (double bond). For our Both sides of this inequality are the linear combina- graph kernel, we replaced an undirected edge by two 0 0 tions of ai−1(hi−1, hi−1, h1, h1) with nonnegative coef- directed edges (Figure 2) since the kernel assumes di- ficients. In order to prove the inequality, it is sufficient rected graphs. to prove the following inequalities for coefficients: X 4.1. Pattern Discovery Algorithm t(i, j, i0, j0)q(i, j) < q(i0, j0), ∀i0, j0, i,j We compare our graph kernel with the pattern- discovery (PD) method by Kramer and De Raedt which we have assumed to hold in (10). (2001) that is one of the best state-of-the-art meth- The condition (10) seems rather complicated, but we ods in predictive toxicology. As in our graph kernel, each feature is constructed as the count that a partic- can have a simpler condition, if the termination prob- 4 abilities are constant over all vertices. ular label path appears in the graph . There are other 0 methods which count more complicated substructures Lemma 1. If pq(i) = pq(j) = γ for any i and j, the 0 such as subgraphs (Inokuchi et al., 2000), but we fo- infinite sequence limL→∞ RL(h1, h1) converges if cus on Kramer and De Raedt (2001) whose features are similar to ours. 0 0 1 K(v, v )K(e, e ) < 2 . (14) (1 − γ) Assume that we have n graphs G1,...,Gn. Also let us define #(h,G) as the number of appearances of a 2 (proof) Due to the assumption,P q(i, j) = q(i0, j0) = γ . label path h in G. The PD method identifies a set of Thus it is sufficient to prove i,j t(i, j, i0, j0) < 1, that all label paths H which appear in more than m graphs: is, X Xn 0 pt(i|i0)pt(j|j0)K(vi, vj)K(ei0i, ej0j) < 1. (15) H = {h | δ (# (h,Gi) > 0) ≥ m}, i,j i=1 Due to Relation (2), we observe that where the parameter m is called the minimum support X X X p (i|i )p0 (j|j ) = p (i|i ) p0 (j|j ) = (1−γ)2. parameter. Furthermore, it is possible to add extra t 0 t 0 t 0 t 0 conditions, e.g., selecting only the paths frequent in a i,j i j certain class, and scarce in the other classes. Then the Thus (15) holds if all the coefficients satisfy feature vector of G based on the identified label paths K(v , v )K(e , e ) < 1/(1 − γ)2. i j i0i j0j 4Notice that the definition of label paths is different Apparently, the above lemma holds for λ > 0 if from ours in several points, e.g. a vertex will not be visited K(·, ·) ≤ 1. Standard label kernels such as (3) and twice in a path. See Kramer and De Raedt (2001) for (4) satistfy this condition. details. is built as Table 1. Several statistics of the datasets such as num- G → (#(h1,G),..., #(h|H|,G)), (16) bers of positive examples (#positive) and negative exam- ples (#negative), maximum degree (max. degree), maxi- whose dimensionality is the cardinality of H. The mum size of graphs (max. |G|), average size of graphs (avg. PD method is useful for extracting comprehensive fea- |G|), and numbers of vertex labels (|Σ|v) and edge la- tures. However, as the minimum support parameter bels (|Σ|e). gets smaller, the dimensionality of feature vectors be- comes so huge that a prohibitive amount of computa- tion is required. Therefore, the user has to control the MM FM MR FR Mutag minimum support parameter m, such that the feature space does not lose necessary information and, at the #positive 129 143 152 121 125 same time, computation stays feasible. #negative 207 206 192 230 63 max. |G| 109 109 109 109 40 The PD method contrasts markedly with our method. avg. |G| 25.0 25.2 25.6 26.1 31.4 Our kernel method puts emphasis on dealing with in- max. degree 4 4 4 4 4 finite, but less interpretable features, while the PD |Σ|v 21 19 19 20 8 |Σ| 4 4 4 4 4 method trys to extract a relatively small number of e meaningful features. Looking at the algorithms, our method is so simple that it is described by just one equation (9), while the PD method’s algorithm is rather complicated (De Raedt & Kramer, 2001). Table 2. Classification accuracies (%) of the pattern dis- covery method. ’MinSup’ shows the ratio of the minimum 4.2. Datasets support parameter to the number of compounds m/n. We used two datasets, the PTC dataset (Helma et al., 2001) and the Mutag dataset (Srinivasan et al., 1996). MinSup MM FM MR FR Mutag The PTC dataset is the results of the following phar- 0.5% 60.1 57.6 61.3 66.7 88.3 maceutical experiments. Each of 417 compounds is 1 % 61.0 61.0 62.8 63.2 87.8 given to four types of test animals: Male Mouse (MM), 3 % 58.3 55.9 60.2 63.2 89.9 Female Mouse (FM), Male Rat (MR) and Female 5 % 60.7 55.6 57.3 63.0 86.2 Rat (FR). According to their carcinogenicity, each 10 % 58.9 58.7 57.8 60.1 84.6 compound is assigned one of the following labels: 20% 61.0 55.3 56.1 61.3 83.5 {EE, IS, E, CE, SE, P, NE, N} where CE, SE and P in- dicate ”relatively active”, and NE and N indicate ”rel- atively inactive”, and EE, IS and E indicate ”can not be decided”. In order to simplify the problem, we re- labeled CE, SE and P as “positive”, and NE and N iterations were enough for convergence in all cases. For as “negative”. The task is to predict whether a given the classification algorithm, we used the voted kernel compound is positive or negative for each type of test perceptron (Freund & Shapire, 1999), whose the per- animals. Thus we eventually had four two-class prob- formance is known to be comparable to SVMs. In the lems. pattern discovery method, the minimum support pa- In the Mutag dataset, the task is defined to be a two- rameter was determined as 0.5, 1, 3, 5, 10, 20% of the class classification problem to predict whether each of number of compounds, and the simple dot product the 188 compounds has mutagenicity or not. in the feature space (16) was used as a kernel. In our graph kernel, the termination probability γ was Each statistics of the datasets are summarized in Table changed from 0.1 to 0.9. 1. Table 2 and Table 3 show the classification accuracies 4.3. Experimental Settings and Results in the five two-class problems measured by leave-one- out cross validation. No general tendencies were found Assuming no prior knowledge, we defined the proba- to conclude which method is better (the PD was - bility distributions for random walks as follows. The ter in MR, FR and Mutag, but our method was better initial probabilities were simply uniform, i.e. ps(h) = in MM and FM). Thus it would be fair to say that 1/|G|, ∀h. The termination probabilities were deter- the performances were comparable in this small set mined as a constant γ over all vertices. The transition of experiments. Even though we could not show that probabilities pt(h|h0) were set as uniform over adjacent our method is constantly better, this result is still ap- vertices. We used Equation (3) as the label kernels. pealing, because the advantage of our method lies in In solving the simultanous equations, we employed a its simplicity both in concepts and in computational simple iterative method (8). In our observation, 20-30 procedures. where the feature matrix is defined as Table 3. Classification accuracies (%) of our graph kernel. The parameter γ is the termination probability of random X∞ X walks, which controls the effect of the length of label paths. Ma1,a2 (G) = p(h1, h`)δ(vh1 = σa1 )δ(vh` = σa2 ),

`=1 h1,h` (17) γ MM FM MR FR Mutag and Σv = {σ1, σ2, . . . , σ|Σv |}. Notice that the probability p(h , h ) is denoted as 0.1 62.2 59.3 57.0 62.1 84.3 1 ` 0.2 62.2 61.0 57.0 62.4 83.5 0.3 64.0 61.3 56.7 62.1 85.1 p(h1, h`) = ps(h1)pq(h`) × 0.4 64.3 61.9 56.1 63.0 85.1 X X 0.5 64.0 61.3 56.1 64.4 83.5 ··· pt(h2|h1) ··· pt(h`|h`−1). 0.6 62.8 61.9 54.4 65.8 83.0 h2 h` 0.7 63.1 62.5 54.1 63.2 81.9 0.8 63.4 63.4 54.9 64.1 79.8 Let T be a |G| × |G| matrix where Tij = pt(j|i). 0.9 62.8 61.6 58.4 66.1 78.7 Also, let S and Q be |G| × |Σv| matrices where Sij = ps(i)δ(vi = σj) and Qij = pq(i)δ(vi = σj), respec- tively. By means of these matrices, the feature matrix is rewritten as 5. Conclusion X∞ > `−1 In this paper, we introduced a new kernel between M(G) = S T Q. graphs with vertex labels and edge labels in the frame- `=1 work of the marginalized kernel. We defined the ker- In his settings, the parameters p , p and p do not nel by using random walks on graphs, and reduced s t q have stochastic constraints, i.e. ps(i) = 1 and pq(i) = the computation of the kernel to solving a system of 1 for any vertex i. Therefore, S = Q := L, where linear simulataneous equations. As contrasted with L = δ(v = j). the pattern-discovery method, our kernel takes into ij i account all possible label paths without computing fea- The feature vector of geometric kernel, M(G) = P∞ > ` ture values explicitly. The structure we dealt with in `=1 L (βE) L, is obtained by setting T = βE, this paper is fairly general, as it includes sequences, where E is the adjacency matrix of G and β is a con- trees, DAGs and so on. Thus, applications of this stant to assure convergence. kernel are by no means limited to chemical com- In the case of exponential kernel, M is defined as P∞ > `−1 pounds. Promising targets would be DNA and RNA M(G) = `=1(L (βE) L)/(` − 1)!, which still re- sequences with remote correlations, HTML and XML sults in a similar form. documents, distance graphs of 3-D protein structures, As seen in (17), the primary difference from our ap- and so on. Recently, we found that our kernel can proach is that they only take the vertex labels at both also be interpreted as an instance of the rational ker- ends (h , h ) into account. Although this simplifica- nels (Cortes et al., 2003) over probability semiring. In 1 ` tion allows to keep the size of the matrices small, it this interpretation, the probability distributions over a might be harmful when contiguous labels are essen- graph are considered as a weighted automaton. In fu- tial as in chemical compounds. Another difference is rure works, we would like to explore how to deal with that they do not use an infinite dimensional feature infinite dimensions when another semiring is adopted. space. Although they also compute the infinite sum in 2 Equation (17), the infinite sum is for obtaining |Σv| Appendix: Relation to Exponential and dimensional feature vectors. Geometric Kernels for Labeled Graphs Here we review the exponential and geometric kernels References between graphs proposed by G¨artner(2002) and de- Abiteboul, S., Buneman, P., & Suciu, D. (2000). Data scribe the relation to our kernel. Although his kernels on the Web: from relations to semistructured data are defined in a quite different formulation from ours, and XML. Morgan Kaufmann. we relate them by using the same notations. Bahlmann, C., Haasdonk, B., & Burkhardt, H. (2002). His kernels are defined as the inner product of two 0 On-line handwriting recognition with support vector |Σv| × |Σv| feature matrices M(G) and M(G ), machines – a kernel approach. Proc. of the 8th Int. Workshop on Frontiers in Handwriting Recognition (IWFHR) (pp. 49–54). |XΣv | |XΣv | 0 0 0 Barrett, R., Berry, M., Chan, T. F., Demmel, J., K(G, G ) = Ma1,a2 (G)ma ,a (G ), 1 2 Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., a1=1 a2=1 Romine, C., & der Vorst, H. V. (1994). Templates Kondor, R. I., & Lafferty, J. (2002). Diffusion kernels for the solution of linear systems: Building blocks on graphs and other discrete input spaces. Proc. of for iterative methods, 2nd edition. Philadelphia, PA: the 19th ICML (pp. 315–322). SIAM. Kramer, S., & De Raedt, L. (2001). Feature construc- Collins, M., & Duffy, N. (2001). Convolution kernel tion with version spaces for biochemical application. for natural language. Proc. of the 14th NIPS. Proc. of the 18th ICML (pp. 258–265). Cortes, C., Haffner, P., & Mohri, M. (2003). Rational Lafferty, J., & Lebanon, G. (2003). Information diffu- kernels. Advances in Neural Information Processing sion kernels. Advances in Neural Information Pro- Systems 15. MIT Press. In press. cessing Systems 15. MIT Press. In press. De Raedt, L., & Kramer, S. (2001). The levelwise ver- Leslie, C., Eskin, E., Weston, J., & Noble, W. (2003). sion space algorithm and its application to molecu- Mismatch string kernels for svm protein classifica- lar fragment finding. Proceedings of the 17th Inter- tion. Advances in Neural Information Processing national Joint Conference on Artificial Intelligence Systems 15. MIT Press. In press. (pp. 853–862). Morgan Kaufmann. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristian- Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. ini, N., & Watkins, C. (2002). Text classification (1998). Biological sequence analysis: Probabilistic using string kernels. Journal of Machine Learning models of proteins and nucleic acids. Cambridge Research, 2, 419–444. University Press. Manning, C. D., & Sch¨utze,H. (1999). Foundations of statistical natural language processing. Cambridge, Freund, Y., & Shapire, R. (1999). Large margin clas- Massachusetts: The MIT Press. sification using the perceptron algorithm. Machine Learning, 37, 277–296. Rugh, W. (1995). Linear system theory. Prentice Hall. G¨artner,T. (2002). Exponential and geometric ker- Sch¨olkopf, B., & Smola, A. J. (2002). Learning with nels for graphs. NIPS*02 Workshop on Unreal Data: kernels. Cambridge, MA: MIT Press. Principles of Modeling Nonvectorial Data. Available from http://mlg.anu.edu.au/unrealdata/. Shimodaira, H., Noma, K.-I., Nakai, M., & Sagayama, S. (2002). Dynamic time-alignment kernel in sup- Haussler, D. (1999). Convolution kernels on dis- port vector machine. Advances in Neural Informa- crete structures (Technical Report UCSC-CRL-99- tion Processing Systems 14 (pp. 921–928). Cam- 10). University of Calfornia in Santa Cruz. bridge, MA: MIT Press. Helma, C., King, R., Kramer, S., & Srinivasan, A. Srinivasan, A., Muggleton, S., King, R. D., & Stern- (2001). The Predictive Toxicology Challenge 2000- berg, M. (1996). Theories for mutagenicity: a study 2001. Bioinformatics, 17, 107–108. of first-order and feature based induction. Artificial Intelligence, 85, 277–299. Inokuchi, A., Washio, T., & Motoda, H. (2000). An Apriori-based algorithm for mining frequent sub- Takimoto, E., & Warmuth, M. K. (2002). Path kernels structures from graph data. Proc. of the 4th PKDD and multiplicative updates. Proc. of the 15th Annual (pp. 13–23). Conference on Computational Learning Theory (pp. 74–89). Jaakkola, T., Diekhans, M., & Haussler, D. (2000). A discriminative framework for detecting remote pro- Tsuda, K., Kin, T., & Asai, K. (2002). Marginalized tein homologies. Journal of Computational Biology, kernels for biological sequences. Bioinformatics, 18, 7, 95–114. S268–S275. Vapnik, V. (1998). Statistical learning theory. New Kandola, J., Shawe-Taylor, J., & Cristianini, N. York: Wiley. (2003). Learning semantic similarity. Advances in Neural Information Processing Systems 15. MIT Vishwanathan, S., & Smola, A. (2003). Fast kernels Press. In press. for string and tree matching. Advances in Neural Information Processing Systems 15. MIT Press. In Kashima, H., & Inokuchi, A. (2002). Kernels for graph press. classification. IEEE ICDM Workshop on Active Mining. Maebashi, Japan. Watkins, C. (2000). Dynamic alignment kernels. In A. Smola, P. Bartlett, B. Sch¨olkopf and D. Schuur- Kashima, H., & Koyanagi, T. (2002). Kernels for semi- mans (Eds.), Advances in large margin classifiers, structured data. Proc. of the 19th ICML (pp. 291– 39–50. MIT Press. 298).