Structural Alignment based Kernels for Structure Classification

Sourangshu Bhattacharya [email protected] Chiranjib Bhattacharyya [email protected] Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore - 560 012, India. Nagasuma Chandra [email protected] Center, Indian Institute of Science, Bangalore - 560012, India.

Abstract acid types. This view is taken by most structure align- Structural alignments are the most widely ment algorithms (Eidhammer et al., 2000), which are used tools for comparing with low the most widely accepted methods for protein struc- sequence similarity. The main contribution ture comparison. of this paper is to derive various kernels on In this paper, we explore the idea of designing clas- proteins from structural alignments, which sifiers based on Support vector machines(SVMs) for do not use sequence information. Central proteins having low sequence similarity. More explic- to the kernels is a novel alignment algorithm itly our goal is to derive positive semi-definite ker- which matches substructures of fixed size us- nels motivated from structural alignments without us- ing spectral graph matching techniques. We ing sequence information. The hypothesis is that any derive positive semi-definite kernels which structure classification technique should capture the capture the notion of similarity between sub- notion of similarity defined by structural alignment structures. Using these as base more sophis- algorithms. Though there is lot of work on design- ticated kernels on protein structures are pro- ing kernels on protein sequences (Jaakkola et al., 1999; posed. To empirically evaluate the kernels we J.-P. Vert, 2004; Leslie & Kwang, 2004), to the best used a 40% sequence non-redundant struc- of our knowledge, there is relatively less work on ker- tures from 15 different SCOP superfamilies. nels on protein structures. (Wang & Scott, 2005) is an The kernels when used with SVMs show com- interesting first step, where both sequence and struc- petitive performance with CE, a state of the ture information are used, to define kernels on protein art structure comparison program. structures. First, we propose a novel protein structure alignment 1. Introduction algorithm by using a spectral projection based sim- ilarity measure. We show that naive spectral graph Classification of proteins into different classes of inter- matching (Umeyama, 1988) based techniques fail to est, is a problem of fundamental interest in computa- produce the correct alignment in many cases due to tional biology. Powerful techniques exist for classify- existence of unmatched residues, called indels, in many ing proteins having high sequence similarity. However, similar structures. This drawback is remedied by con- these methods are not reliable when sequence similar- sidering pairs of substructures from each protein. ity falls in the twilight zone (Bourne & Shindyalov, The main contribution of this paper is to propose ker- 2003), i.e. is in the range 20 − 30%. Developing clas- nels on protein structures based on structural align- sification tools for such proteins is an important re- ments. Following the idea of sub-structure matching, search issue. Since sequence information is unreliable we propose novel kernels on protein sub-structures us- it is interesting to think about classification tools based ing the spectral projection vector and pairwise dis- on structures alone, and deliberately ignore the amino tances, and show that a limiting case of the spec- Appearing in Proceedings of the 24 th International Confer- tral kernel relates to the alignment score on sub- ence on Machine Learning, Corvallis, OR, 2007. Copyright structures. Combining these kernels on sub-structures, 2007 by the author(s)/owner(s). we define various kernels on protein structures. Using Alignment Kernels for Protein Structures substructure kernels, we also define kernels on pro- also called the distance matrix of a protein structure. tein structures that take any pairwise structural align- The drawback of preferring smaller lengths was reme- ment, explicitly into account. Benchmarking exper- died in formulation proposed in DALI (Holm & iments are conducted on a sequence non-redundant Sander, 1993). The score function used in DALI is SCOP dataset. We benchmark our results against A B 2 |dij − dφ(i)φ(j)| d¯lk CE (Bourne & Shindyalov, 1998), a state of the art SDALI (φ) = 0.2 − exp − d¯ij 20 pA,pA P¯A ! „ « ! structure comparison program on a dataset with low i Xj ∈ sequence similarity. Extensive experiments show that ¯ A B the results are promising. where dij = (dij + dφ(i)φ(j))/2. DALI stochastically searches over all alignments for the φ which maximizes The paper is organized as follows. Section 2 de- S . Note that, residues which are spatially far do scribes development of the structure alignment algo- DALI not contribute much to the DALI score. This is due rithm. Section 3 develops the kernels for over sub- to the exponentially decreasing function of the aver- structures and subsequently over protein structures. age distance used to weight each term. Following this Section 4 describes the experimental results. observation, we define the adjacency matrix of a pro- −dij tein as Aij = e α , α > 0. This is an exponentially 2. Protein Structure Alignment decreasing function between 0 and 1, making the en- 2.1. Background tries corresponding to far away residues very close to zero. The problem of finding optimal correspondences The tertiary structure of a protein is represented by can be viewed as an weighted graph matching prob- the 3 dimensional coordinates of all atoms present lem (Umeyama, 1988) with weights given by the adja- in a given protein chain, with respect to an arbi- cency matrix values. In the next section, we propose trary coordinate system. However, following common an alternate formulation using spectral graph theoretic practice, we consider only Cα atoms of each residue. techniques. Thus, we represent protein structures as a set of points corresponding to residues in 3 dimensions. So, a 2.2. Spectral Solution to Protein Structure protein P is represented as P = {p1,...,pn} where Alignment 3 pi ∈ R , 1 ≤ i ≤ n. A structural alignment between A B A Consider the ideal situation where both protein P A two proteins P and P is a 1-1 mapping φ : {i|pi ∈ ¯A B ¯B ¯A A ¯B B and P B have equal number of residues, all of which P } → {j|pj ∈ P }, where P ⊆ P and P ⊆ P . The mapping φ defines a set of correspondences be- have a corresponding residue in the other structure. tween the residues of the two proteins. |P¯A| = |P¯B| In such a case, the problem of finding optimal align- is called the length of the alignment. Given a struc- ment between the two proteins is same as finding the tural alignment φ between 2 structures P A and P B, optimal permutation of the residues of one of the pro- A B and a transformation T of structure B onto A, a teins. Let A and A be the adjacency matrices popular measure of the goodness of superposition is corresponding to the two proteins. Consider their A n A A A T the root mean square deviation (RMSD), defined as eigenvalue decompositions A = i=1 λi ζi (ζi ) B n B B B T P 1 A B 2 and A = i=1 λi ζi (ζi ) , with terms on the RHS RMSD(φ) = ¯A pA∈P¯A (pi −T (pφ i )) Given P q |P | P i ( ) sorted according to decreasing value of λ. It is easy to an alignment φ, the optimal transformation T , mini- A B see that the eigenvectors ζi and ζi will be related by mizing RMSD, can be computed in closed form using the same permutation as the residues of the original the method described in (Horn, 1987). proteins. Thus, the correct alignment can be retrieved However, using RMSD as a measure for evaluat- by calculating the permutation that best matches the A B ing alignments has 2 problems: the optimal trans- entries of ζi and ζi . formation T needs to be computed for every align- The adjacency matrix A of any protein can be ap- ment, and RMSD favors alignments of lower lengths proximated by k eigenvectors with an error η, given which may not capture the total similarity between k T 2 n 2 2 2 by: kA − i=1 λiζiζi kF = j=k+1 λj = η kAkF . 2 proteins in presence of noise. An alternate mea- For k = 1,Pη is minimized whenP the eigenvector cor- sure, called the distance RMSD avoids the calcu- responding to the largest eigenvalue, ζ1, is considered. lation of optimal transformation. Distance RMSD th The j component of the leading eigenvector, ζ1(j) for an alignment φ is defined as RMSDD(φ) = can be thought of as a projection of the corresponding 1 A B 2 A ¯A 2 pA,pA∈P¯A (dij − d ) where, dij is the residue on to the real line. It was also observed that q |P | P i j φ(i)φ(j) A A spatially close residues have similar projected values; distance between residues pi and pj . The matrix d is i.e. if Ajl is high, then |ζ1(j) − ζ1(l)| is low. Hence, ζ1 Alignment Kernels for Protein Structures can be thought of as projections that preserve neigh- 2. For each pair of sub-structures, one from each borhood (Bhattacharya et al., 2006). protein, compute the alignment between the sub- structures by solving the problem described in Let ζA and ζB be the eigenvectors corresponding to 1 1 equations 1 and 2. the largest eigenvalues of the adjacency matrices of A B proteins P and P . Define spectral projection vector 3. For each sub-structure alignment computed f, to be a vector of absolute values of components of above, compute the optimal transformation of the A A B B A B ζ. Thus, fi = |ζ1 (i)| and fi = |ζ1 (i)|. The absolute sub-structure from P onto the one from P . values of the eigenvectors are taken because if ζ is an Transform the whole of P A onto P B using the eigenvector, so is −ζ. We define the similarity between computed transformation, and compute the simi- A B residue i of P and residue j of P as: larity score between residues of P A and P B.

A B 2 A s(i, j) = T − (fi − fj ) (1) 4. Compute the optimal alignment between P and B Here, T is a threshold on the minimum similarity be- P by solving the assignment problem described A B in equation 2 using similarity score computed tween fi and fJ for them to be included in an align- ment. Putting T to be very high will match the en- above. tire structures of two proteins. Using this similarity 5. Report the best alignment of all the alignments function, we are interested in finding an alignment computed in the above step as the optimal one. A ¯A B ¯B φ : {i|pi ∈ P } → {j|pj ∈ P } that solves: The sub-structure based algorithm relies on the as- S(φ∗) = max s(i, φ(i)) (2) φ X sumption that the optimal structural alignment be- i:pA∈P¯A i tween two protein structures contains at least one pair This is an instance of the assignment problem, or of optimally and fully aligned sub-structures, one from weighted bipartite graph matching problem, which can each of the proteins. For each sub-structure alignment be solved using linear programming (Bertsimas & Tsit- computed in step 2 of the above algorithm, the op- siklis, 1997). timal transformation superposing the corresponding residues is calculated using the method described in Unfortunately, in many cases, this technique fails to (Horn, 1987). The “best” alignment in mentioned in detect the best alignment (see results). This is due to step 5 is decided on the basis of both RMSD and the the fact that many protein structures have a significant length of alignment. A detailed description and bench- number of residues which don’t have a correspond- marking of the method will be presented elsewhere. ing residue in other similar structure. These extra Taking ideas from the algorithm designed in this sec- residues, called indels, add extra rows and columns to tion, we propose kernels for proteins structures in the the adjacency matrix, thereby changing the eigenval- next section. ues and eigenvectors. This shortcoming of the spectral method can be overcome in case of protein structures by considering sub-structures instead of the whole 3. Kernels for Protein Structure structures for alignment. The reason is that even Classification though there may be many indels between two sim- 3.1. Kernels for Sub-structures ilar structures, the portions responsible for function of the protein or for maintaining the protein fold remain The algorithm designed in the previous section mo- highly conserved. Thus, considering sub-structures of tivates the notion of first proposing kernels on sub- equal size (in number of residues) around each residue structures and then suitably combining them to derive is likely to give many entirely matched pairs of sub- kernels on entire protein structures. To this end, we structures in case of similar proteins, and vice versa. propose a kernel over sub-structures in this section. A A (Wang & Scott, 2005) also combine substructure ker- A sub-structure Ni of protein P centered at residue i is a set of l residues that are closest to residue i nels to build kernels on protein structures. However, in 3-D space, l being the size of sub-structures being their kernels use similarity inside substruc- considered. Thus, N A = {pA ∈ P A|pA 6∈ N A ⇒ kpA − tures. Also, since a kernel function captures the simi- i j k i j larity between two objects, the algorithm gives us the pAk ≤ kpA−pAk and |N A| = l}. The robust algorithm i k i i idea of using the spectral projection vectors for each for comparing protein structuresP A and P B is: sub-structure to define the substructure kernels.

1. Compute the sub-structures centered at each Let N1 and N2 be the sub-structures, each having l residue for both proteins P A and P B. residues. Let f 1 and f 2 be the spectral projection Alignment Kernels for Protein Structures vectors for each sub-structure. We define the kernel Pairwise distance substructure kernel. Using th th Kres between i residue of N1 and j residue of N2 the convolution kernel technique, we define another as the decreasing function of the difference in spec- kernel on substructures based on pairwise distances 1 2 2 −(fi −fj ) between residues. In this case, X is the set of all tral projection: K (i, j) = e β . Technically, res sub-structures and X ,...,X are all sets of all pair- since a kernel should be defined over a set of objects, 1 m wise distances d , i

Convolution kernels were proposed in (Haussler, −kdi−π(dj )k2 K (N , N ) = e σ2 (5) 1999) as a general tool for building kernels on PDS i j X complex objects formed by combining simple π∈Π(l) objects, on which kernels are already known. Let x ∈ X be a composite object formed using Using the result in (Haussler, 1999), KSS and KPDS parts from X1,...,Xm. Let R be a relation over are positive semidefinite. Next, we prove a relation X1 × · · · × Xm × X such that R(x1,...,xm,x) is between the score obtained by solving the assignment −1 true if x is composed of x1,...,xm. Let R (x) = problem in equation 2 and KSS. (x1,...,xm) ∈ X1 × · · · × Xm|R(x1,...,xm,x) = true 1 m and K ,...,K be kernels on X1,...,Xm, respec- Theorem 3.1 Let Ni and Nj be two sub-structures tively. The convolution kernel K over X is defined with spectral projection vectors f i and f j . Let as: S(Ni, Nj ) be the score of alignment of Ni and Nj , ob- m tained by solving the assignment problem specified in K(x,y) = Ki(x ,y ) equation 2, for large enough value of T such that all X Y i i −1 −1 (x1,...,xm)∈R (x),(y1,...,ym)∈R (y) i=1 residues are matched. (3) lT β S(Ni,Nj ) 1 m e lim KSS(Ni, Nj )) = e As shown in (Haussler, 1999), if K ,...,K are sym- β→0 metric and positive semidefinite, so is K. Proof: Let π∗ be the permutation that gives the Sub-structure Spectral Kernel. In our case, X S(Ni,Nj ) optimal score S(Ni, Nj ). By definition, e = is the set of all sub-structures and X1,...,Xm are all l T f i f j 2 i ∗ j 2 k=1 −( k− π(k)) lT −kf −π (f )k sets of all the residues p ’s from all the sub-structures. maxπ∈Π(l) e = e e . i P Here, note that even if the same residue appears in different sub-structures, the appearances will be con- β sidered to be different. Since, each sub-structure has limβ→0(KSS(Ni, Nj )) −kfi−π(fj )k2 β β a size of l, in our case m = l. The relation R is de- = limβ→0( π∈Π(l) e ) i ∗Pj 2 fined as R(p1,...,pl, N) is true iff N = {p1,...,pl}. −kf −π (f )k β = e limβ→0(1 + π∈Π(l)\{π∗} T1) Since, all the X ’s have all the residues from N, the i ∗ j 2 P i = e−kf −π (f )k cases for which R can hold true are the permutations of residues of N. Since, this can happen for both N1 −1 (kf i−π(f j )k2−kf i−π∗(f j )k2) where, T1 = e β . and N2, each combination of correspondences occurs l! times. Thus, from the above equation, the kernel be- A similar result holds for KPDS. However, in that case 1 − l (f 1−f 2 )2 β k=1 k π(k) optimizing the similarity measure is an NP-complete comes: K(N1, N2) = l! π∈Π(l) e = −kf1−π(f2)k2 P P problem (Pardalos et al., 1994). A similar result l! e β where, f i is the spectral pro- Pπ∈Π(l) was proved for the local alignment kernel defined on jection vector of Ni and Π(l) is the set of all possible strings in (J.-P. Vert, 2004). We define the limit- permutations of l numbers. Since l! is a constant, we ing case of KSS as a new kernel KLSS(N1, N2) = β define the sub-structure spectral kernel as: limβ→0(KSS(N1, N2)) . −kfi−π(fj )k2 K (N , N ) = e β (4) Calculation of KSS and KPDS takes time exponential SS i j X π∈Π(l) in the substructure size. However, it is not a major Alignment Kernels for Protein Structures hindrance as the substructure size is fixed to a small of these kernels is O(n4) which is also more than that number (6 in our experiments). KLSS can be com- of many structural alignment algorithms (Bourne & puted in time polynomial in substructure size. Un- Shindyalov, 1998; Bhattacharya et al., 2006). fortunately, K is not necessarily positive semidefi- LSS Another interesting way of increasing accuracy of ker- nite. However, in the next section, we develop positive nels is by taking correspondences produced by a struc- semidefinite kernels on protein structures using K , SS tural alignment algorithm into account while comput- KPDS and KLSS. ing the kernel values. Thus, given an alignment φij between proteins P i and P j we define the alignment 3.2. Kernels on Protein Structures kernels as :

Viewing kernel values as a measure of similarity be- Al i j i j K (P ,P ; φij ) = KSS(Na, N ) (10) tween the proteins, it is natural to consider the sum 1 X φij (a) a pi P¯i of sub-structure kernels over all pairs of sub-structures | a∈ between the two proteins as a kernel between the pro- Al Al i i i K2 and K3 are defined analogously by replacing KSS teins. Thus, given proteins P = {p ,...,pn } and 1 i by K and K in the above equation, respec- P j = {pj ,...,pj }, we define the following two kernel LSS PDS 1 nj tively. Unfortunately, these kernels are not necessarily based on the two sub-structure kernels defined earlier: positive semidefinite. For such cases, a standard trick n ni j is to compute the eigenvector decomposition of the ker- K (P i,P j) = K (N i , N j ) (6) 1 X X SS a b nel matrix, force all the negative eigenvalues to be zero a=1 b=1 and recompute the kernel matrix from the modified n ni j eigenvector decomposition. In experiments, this trick K (P i,P j) = K (N i , N j ) (7) 2 X X PDS a b is used to make these kernels positive semidefinite. a=1 b=1 However, this requires computation of the eigenvalues i i Here, Na is the sub-structure in protein P centered at of the kernel matrix. We propose another set of ker- residue pi . Both K and K are positive semidefinite nel that while keeping the off-diagonal terms intact a 1 2 and only modifying the diagonal terms, is always pos- because they are sum of positive semidefinite kernels. itive semidefinite. Thus, given a dataset of proteins 1 M The kernels defined above add up the substructure ker- {P ,...,P }, and all possible pairwise alignments φij i j nel values for all possible pairs of sub-structures. How- between proteins P and P , we define the kernel K3 ever, this is bad estimate of the similarity between the as: proteins as most of the sub-structure pairs even for i j KSS (N , N ) if i 6= j similar proteins might not be similar. One possible a φij (a) a|pi ∈P¯i way of increasing the accuracy is to consider two sub- Al i j 8 Xa M K4 (P ,P ) = structures from each protein and weight the substruc- > i i > KSS (N , N ) if i = j ture kernels with a decreasing function of the distance < a φib(a) b=1 a|pi ∈dom(φ ) > X a X ib between them. To maintain positive semi-definiteness > (11) of the resulting function, it is necessary that the :> weighting function be positive semidefinite. We use where, dom(φib) is the domain of function φib, which i i j j is the set of all residues participating in the alignment the gaussian kernel, Knorm((Na, Nb ), (Nc , Nd )) = i Al Al j j (kpi −pi k−kp −p k) φib from structure P . K and K are defined anal- a b c d 5 6 e σ2 , as the weighting function. Note ogously by replacing KSS with KLSS and KPDS, re- that in this case the kernel is over all pairs of sub- spectively. structures in each structure. Thus, we define two more kernels as: Al Theorem 3.2 K4 is positive semidefinite if KSS is n ni j positive valued. K i j K i i × K j j × 3(P ,P ) = X X SS (Na,Nb ) SS (Nc ,Nd ) a,b=1 c,d=1 Proof: Let Lmax be the maximum length of align- K i i j j norm((Na,Nb ), (Nc ,Nd )) (8)ments between all pairs of proteins in the dataset. M M ni nj ( +1) i j i i j j Consider the M × ( 2 Lmax) matrix H having K4 P ,P K N ,N × K N ,N × ( ) = X X PDS ( a b ) SS ( c d ) one row for each proteins in the dataset. Each row a,b=1 c,d=1 M(M+1) has block of length Lmax corresponding to K N i ,N i , N j ,N j 2 norm(( a b ) ( c d )) (9)each pairwise alignment (including alignment of every structure with itself). For rows corresponding to pro- It is easy to see that K (P i,P j ) and K (P i,P j ) 3 4 teins i and j, the kth element of the block for alignment are both positive semidefinite. Even though these i j φij is equal to KSS(N , N ). The index k runs kernels are more accurate than the previous two, q k φij (k) they computationally expensive. Computation time over all the correspondences in the alignment. For Alignment Kernels for Protein Structures alignments that have length smaller than Lmax, put On the pair 2PEL - 5CNA the spectral algorithms give the remaining entries to be zero. It can be seen that much better alignments than the existing methods. Al T Al K4 = HH . As, each entry in K4 is a dot product This is due to the fact that this pair of proteins is re- Al between two vectors, K4 is positive semidefinite. lated by circular permutations. Also, in order to test the performance of the algorithms on proteins showing Note that K needs to be positive valued and not SS indels, we aligned the individual domains of the multi- positive semidefinite. So, KAl and KAl are also pos- 5 6 domain protein 2HCK with the individual domains (ta- itive semidefinite, even though K is not. Com- LSS ble 1). It is clear that the algorithm calculating sim- putational complexity of K and K is O(n n ) and 1 2 1 2 ilarity from spectral projections of the entire protein that of K and K is O(n2n2). The alignment kernels 3 4 1 2 structure doesn’t report the correct alignments even have complexity of O(N ), where N is the number of al al when the sub-structures are highly similar (RMSD 0). aligned residues. In the next section, we report results Thus, it is clear that the sub-structure based method is of experiments conducted for validating the structure more robust to indels than the naive spectral method. alignment algorithm and the kernels developed in this section for protein structure classification. 4.2. Structure Classification using Kernels 4. Experimental Results In order to test the effectiveness of kernels proposed in this article, we consider a difficult task of classify- The algorithms and kernels developed above were im- ing proteins with low sequence similarity, the SCOP plemented and tested on real protein structures from (Murzin et al., 1995). SCOP provides a 40% se- various structural classes obtained from PDB (Berman quence non-redundant dataset (the lowest cutoff pro- et al., 2000). The alignment algorithms and ker- vided by SCOP), having about 4000 proteins. The en- nels were implemented in C using GCC/GNU Linux. tire dataset was taken, and superfamilies having more Eigenvalue computations were done using Lapack 1. than 10 proteins were found (total 21 of them). Of SVM classifications were done using Libsvm 2. these 15 superfamilies were taken and 10 proteins from Alignment algorithms were validated against their each superfamily were chosen randomly. popularly used counterpart e.g. DALI (Holm & Experiments were performed on this dataset, using a Sander, 1993) and CE (Bourne & Shindyalov, 1998). protocol similar to that in (Wang & Scott, 2005). 15 Due to lack of space, only indicative results are re- binary classification problems were posed for identi- ported here. Kernels were evaluated on the the task fying each class, where the negative data contained of classifying protein structures. SCOP (Murzin et al., 10 proteins (to keep the dataset balanced) randomly 1995) was taken as the standard for protein structure chosen from all other classes. Leave one out cross val- classification. Nearest neighbor classification using zs- idation using SVM was performed on all 15 of such core given by CE (Bourne & Shindyalov, 1998) was classification problems, and positive and negative er- used as the standard method. ror rates are reported in table 2. The experiments were repeated 10 times with different negative datasets, and 4.1. Structural Alignment Results from the average accuracy was reported. Alignments re- Spectral Projection based Algorithms quired for the alignment kernels were computed using the method described in (Bhattacharya et al., 2006). In order to test the protein structure alignment al- Also, the alignment kernels (KAl − KAl), showed the gorithms described in section 2.2, we used the algo- 4 6 problem of diagonal dominance, which was remedied rithms to compare a number of similar protein struc- using the method suggested in (Sch¨olkopf et al., 2002). tures retrieved from the PDB. Two other popularly Results reported for CE (Bourne & Shindyalov, 1998) used protein structure algorithms, e.g. DALI (Holm are same experiments, except that nearest neighbor & Sander, 1993) and CE (Bourne & Shindyalov, 1998) classifier was used taking zscore values as the similar- were also used to compare the same proteins. Table ity measure. 1 shows the RMSDs and lengths of alignment given by the programs for a set of proteins from a variety of Kernels K1 and K2 perform reasonably well in terms classes. The results show that on similar proteins both of positive classification accuracy, though on an aver- the spectral methods perform at par with the existing age, negative classification accuracy is lower. K3 and techniques. K4 also show similar performance except on the last 1 class (SCOP superfamily a.39.1). Performance of K3 Available at: http://www.netlib.org/clapack/ 2 and K did not achieve the expected improvement due Available at: http://www.csie.ntu.edu.tw/˜cjlin/libsvm 4 to inadequate tuning of parameters. In general, the Alignment Kernels for Protein Structures

Al Al alignment kernels K1 −K6 , show better performance tein structure comparison and alignment. In P. E. than other kernels. Also, out of the alignment ker- Bourne and H. Weissig (Eds.), Structural bioinfor- Al Al nels, the positive semidefinite kernels (K4 −K6 ) show matics, 321–337. Wiley-Liss. better performance than the non-positive semidefinite Eidhammer, I., Jonassen, I., & Taylor, W. R. (2000). ones (KAl − KAl), as expected. Finally, the positive 1 3 Structure comparison and structure patterns. Jour- semidefinite alignment kernels (KAl − KAl) show bet- 4 6 nal of Computational Biology, 7, 685–716. ter overall classification accuracy than CE, a state of the art structure comparison tool. Interestingly, CE Haussler, D. (1999). Convolution kernels on discrete shows much lower negative classification accuracy than structures (Technical Report). University of Califor- many of the kernels. This can be attributed to better nia, Santa Cruz. generalization capabilities of SVMs over nearest neigh- bor classifiers. Thus, it is clear from experiments that Holm, L., & Sander, C. (1993). Protein structure com- kernels based alignment outperform, state of the art parison by alignment of distance matrices. Journal structure classification tools in terms of accuracy. of Molecular Biology, 233, 123–138. Horn, B. K. P. (1987). Closed form solution of absolute Kernels K1, K2 take less than 5s, and K3, K4 take less than 30s for proteins having around 200 residues orientation using unit quaternions. Journal of the on an Athlon 2.2GHz based desktop computer. The Optical Society of America, 4, 629–642. alignment kernels take less than 1s for 200 residue pro- J.-P. Vert, H. Saigo, T. A. (2004). Kernel methods in teins. computational biology, chapter Local alignment ker- nels for biological sequences, 131–154. MIT Press. 5. Conclusion Jaakkola, T., Diekhaus, M., & Haussler, D. (1999). Us- In this study, the main objective was to derive ker- ing the fisher kernel method to detect remote protein nels on protein structures, without using sequence in- homologies. 7th Intell. Sys. Mol. Biol., 149–158. formation. Several kernels were derived from struc- Leslie, C., & Kwang, R. (2004). Fast string kernels us- tural alignments, and were benchmarked against CE, ing inexact matching for protein sequences. Journal on the difficult task classifying proteins having low se- of Machine Learning Research, 5, 1435 – 1455. quence similarity, with extremely encouraging results. Though the kernels were designed on protein struc- Murzin, A. G., Brenner, S. E., Hubbard, T., & tures, it is interesting to explore it’s applications to Chothia, C. (1995). Scop: a structural classifica- other 3D pointsets. tion of proteins database for the investigation of se- quences and structures. Journal of Molecular Biol- Acknowledgement: The authors are indebted to ogy, 247, 536–540. MHRD, Govt. of India, for supporting this research throught grant number F26-11/2004. Pardalos, P., Rendl, F., & Wolkowicz, H. (1994). The quadratic assignment problem: a survey and recent References developments. In P. Pardalos and H. Wolkowicz (Eds.), Quadratic assignment and related problems Berman, H. M., Westbrook, J., Feng, Z., Gilliland, (new brunswick, NJ, 1993), 1–42. Providence, RI: G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Amer. Math. Soc. Bourne, P. E. (2000). The . Nu- cleic Acids Research, 28, 235–242. Sch¨olkopf, B., Weston, J., Eskin, E., Leslie, C. S., & Noble, W. S. (2002). A kernel approach for learning Bertsimas, D., & Tsitsiklis, J. (1997). Introduction to from almost orthogonal patterns. ECML (pp. 511– linear optimization. Athena Scientific. 528). Bhattacharya, S., Bhattacharyya, C., & Chandra, N. Umeyama, S. (1988). An eigendecomposition ap- (2006). Projections for fast protein structure re- proach to weighted graph matching problems. IEEE trieval. BMC Bioinformatics, 7 suppl., 5:S5. transactions on pattern analysis and machine intel- ligence, 10, 695–703. Bourne, P. E., & Shindyalov, I. N. (1998). Protein structure alignment by incremental combinatorial Wang, C., & Scott, S. D. (2005). New kernels for pro- extension of optimal path. Protein Engineering, 11, tein structural notif discovery and function classifi- 739–747. cation. International Conference on Machine Learn- ing. Bourne, P. E., & Shindyalov, I. N. (2003). Pro- Alignment Kernels for Protein Structures

Table 1. Comparison of results for pairwise protein structure comparison from different programs Protein1 Protein2 C.E. DALI Spectral (full prot.) Spectral (robust) (length) (length) RMSD (N) RMSD (N) RMSD (N) RMSD (N) 1DWT:A (152) 2MM1 (153) 0.7 (152) 0.8 (152) 0.749 (152) 0.74 (152) 5CNA:A (237) 2PEL:A (232) 1.2 (115) 1.3 (117) 1.911 (223) 1.33 (221) 2ACT (218) 1PPN (212) 1.0 (211) 0.9 (210) 0.877 (210) 0.82 (211) 1HTI:A (248) 1TIM:A (247) 0.9 (246) 1.0 (247) 1.023 (247) 0.86 (245) 2HCK:A (437) d2hcka1 (63) - 0.0 (63) 2.81 (34) 0.0 (63) 2HCK:A (437) d2hcka2 (103) - 0.0 (103) 3.12 (58) 0.0 (103) 2HCK:A (437) d2hcka3 (271) - 0.0 (271) 0.00 (271) 0.0 (271)

Table 2. Positive and negative classification accuracies on 15 SCOP classes for various kernels and CE Al Al SCOP K1 K2 K3 K4 K1 K2 Classfn TP% TN% TP% TN% TP% TN% TP% TN% TP% TN% TP% TN% a.4.5 77.00 66.00 62.00 75.00 55.00 85.00 82.00 87.00 94.00 89.00 91.00 97.00 c.66.1 63.00 45.00 72.00 38.00 81.00 49.00 55.00 54.00 83.00 100.00 97.00 100.00 b.18.1 92.00 78.00 89.00 81.00 74.00 50.00 98.00 69.00 87.00 92.00 90.00 82.00 c.52.1 67.00 52.00 72.00 50.00 58.00 47.00 47.00 49.00 35.00 95.00 43.00 80.00 c.37.1 51.00 36.00 58.00 36.00 35.00 47.00 52.00 44.00 33.00 86.00 42.00 78.00 b.29.1 100.00 87.00 87.00 71.00 82.00 51.00 80.00 81.00 70.00 98.00 79.00 95.00 c.108.1 66.00 45.00 74.00 50.00 70.00 40.00 81.00 59.00 65.00 100.00 79.00 99.00 c.47.1 63.00 44.00 35.00 38.00 52.00 63.00 54.00 63.00 81.00 90.00 81.00 82.00 b.1.18 72.00 65.00 70.00 75.00 49.00 75.00 63.00 81.00 93.00 49.00 91.00 85.00 g.39.1 59.00 45.00 54.00 86.00 60.00 97.00 76.00 95.00 100.00 15.00 96.00 57.00 b.40.4 65.00 72.00 77.00 76.00 53.00 87.00 68.00 83.00 87.00 58.00 90.00 83.00 c.55.3 62.00 30.00 65.00 39.00 68.00 42.00 60.00 56.00 60.00 99.00 68.00 96.00 c.55.1 57.00 38.00 58.00 47.00 33.00 50.00 54.00 35.00 58.00 97.00 65.00 85.00 c.2.1 68.00 45.00 65.00 64.00 42.00 41.00 60.00 47.00 84.00 93.00 87.00 91.00 a.39.1 83.00 75.00 93.00 94.00 30.00 0.00 30.00 11.00 85.00 91.00 88.00 87.00

Contd...

Al Al Al Al SCOP K3 K4 K5 K6 CE Classfn TP% TN% TP% TN% TP% TN% TP% TN% TP% TN% a.4.5 93.00 93.00 83.00 91.00 82.00 90.00 90.00 92.00 93.00 60.00 c.66.1 81.00 100.00 99.00 86.00 90.00 85.00 100.00 90.00 99.00 45.00 b.18.1 94.00 78.00 80.00 88.00 82.00 87.00 98.00 84.00 100.00 77.00 c.52.1 31.00 82.00 60.00 62.00 65.00 64.00 76.00 65.00 86.00 54.00 c.37.1 27.00 89.00 70.00 50.00 66.00 53.00 76.00 67.00 93.00 41.00 b.29.1 66.00 96.00 84.00 66.00 86.00 69.00 97.00 82.00 97.00 74.00 c.108.1 69.00 100.00 81.00 69.00 84.00 66.00 92.00 80.00 100.00 69.00 c.47.1 68.00 89.00 64.00 71.00 61.00 68.00 72.00 71.00 98.00 58.00 b.1.18 96.00 45.00 81.00 80.00 84.00 81.00 85.00 82.00 99.00 78.00 g.39.1 100.00 37.00 90.00 91.00 84.00 89.00 94.00 86.00 85.00 85.00 b.40.4 92.00 61.00 75.00 78.00 67.00 78.00 80.00 78.00 98.00 70.00 c.55.3 59.00 98.00 89.00 66.00 80.00 67.00 98.00 84.00 100.00 56.00 c.55.1 57.00 91.00 89.00 69.00 91.00 69.00 85.00 72.00 99.00 52.00 c.2.1 92.00 98.00 100.00 89.00 95.00 87.00 95.00 78.00 100.00 57.00 a.39.1 83.00 83.00 88.00 83.00 93.00 88.00 90.00 92.00 100.00 74.00