Structural Alignment Based Kernels for Protein Structure Classification
Total Page:16
File Type:pdf, Size:1020Kb
Structural Alignment based Kernels for Protein Structure Classification Sourangshu Bhattacharya [email protected] Chiranjib Bhattacharyya [email protected] Dept. of Computer Science and Automation, Indian Institute of Science, Bangalore - 560 012, India. Nagasuma Chandra [email protected] Bioinformatics Center, Indian Institute of Science, Bangalore - 560012, India. Abstract acid types. This view is taken by most structure align- Structural alignments are the most widely ment algorithms (Eidhammer et al., 2000), which are used tools for comparing proteins with low the most widely accepted methods for protein struc- sequence similarity. The main contribution ture comparison. of this paper is to derive various kernels on In this paper, we explore the idea of designing clas- proteins from structural alignments, which sifiers based on Support vector machines(SVMs) for do not use sequence information. Central proteins having low sequence similarity. More explic- to the kernels is a novel alignment algorithm itly our goal is to derive positive semi-definite ker- which matches substructures of fixed size us- nels motivated from structural alignments without us- ing spectral graph matching techniques. We ing sequence information. The hypothesis is that any derive positive semi-definite kernels which structure classification technique should capture the capture the notion of similarity between sub- notion of similarity defined by structural alignment structures. Using these as base more sophis- algorithms. Though there is lot of work on design- ticated kernels on protein structures are pro- ing kernels on protein sequences (Jaakkola et al., 1999; posed. To empirically evaluate the kernels we J.-P. Vert, 2004; Leslie & Kwang, 2004), to the best used a 40% sequence non-redundant struc- of our knowledge, there is relatively less work on ker- tures from 15 different SCOP superfamilies. nels on protein structures. (Wang & Scott, 2005) is an The kernels when used with SVMs show com- interesting first step, where both sequence and struc- petitive performance with CE, a state of the ture information are used, to define kernels on protein art structure comparison program. structures. First, we propose a novel protein structure alignment 1. Introduction algorithm by using a spectral projection based sim- ilarity measure. We show that naive spectral graph Classification of proteins into different classes of inter- matching (Umeyama, 1988) based techniques fail to est, is a problem of fundamental interest in computa- produce the correct alignment in many cases due to tional biology. Powerful techniques exist for classify- existence of unmatched residues, called indels, in many ing proteins having high sequence similarity. However, similar structures. This drawback is remedied by con- these methods are not reliable when sequence similar- sidering pairs of substructures from each protein. ity falls in the twilight zone (Bourne & Shindyalov, The main contribution of this paper is to propose ker- 2003), i.e. is in the range 20 − 30%. Developing clas- nels on protein structures based on structural align- sification tools for such proteins is an important re- ments. Following the idea of sub-structure matching, search issue. Since sequence information is unreliable we propose novel kernels on protein sub-structures us- it is interesting to think about classification tools based ing the spectral projection vector and pairwise dis- on structures alone, and deliberately ignore the amino tances, and show that a limiting case of the spec- Appearing in Proceedings of the 24 th International Confer- tral kernel relates to the alignment score on sub- ence on Machine Learning, Corvallis, OR, 2007. Copyright structures. Combining these kernels on sub-structures, 2007 by the author(s)/owner(s). we define various kernels on protein structures. Using Alignment Kernels for Protein Structures substructure kernels, we also define kernels on pro- also called the distance matrix of a protein structure. tein structures that take any pairwise structural align- The drawback of preferring smaller lengths was reme- ment, explicitly into account. Benchmarking exper- died in formulation proposed in DALI (Holm & iments are conducted on a sequence non-redundant Sander, 1993). The score function used in DALI is SCOP dataset. We benchmark our results against A B 2 |dij − dφ(i)φ(j)| d¯lk CE (Bourne & Shindyalov, 1998), a state of the art SDALI (φ) = 0.2 − exp − d¯ij 20 pA,pA P¯A ! „ « ! structure comparison program on a dataset with low i Xj ∈ sequence similarity. Extensive experiments show that ¯ A B the results are promising. where dij = (dij + dφ(i)φ(j))/2. DALI stochastically searches over all alignments for the φ which maximizes The paper is organized as follows. Section 2 de- S . Note that, residues which are spatially far do scribes development of the structure alignment algo- DALI not contribute much to the DALI score. This is due rithm. Section 3 develops the kernels for over sub- to the exponentially decreasing function of the aver- structures and subsequently over protein structures. age distance used to weight each term. Following this Section 4 describes the experimental results. observation, we define the adjacency matrix of a pro- −dij tein as Aij = e α , α > 0. This is an exponentially 2. Protein Structure Alignment decreasing function between 0 and 1, making the en- 2.1. Background tries corresponding to far away residues very close to zero. The problem of finding optimal correspondences The tertiary structure of a protein is represented by can be viewed as an weighted graph matching prob- the 3 dimensional coordinates of all atoms present lem (Umeyama, 1988) with weights given by the adja- in a given protein chain, with respect to an arbi- cency matrix values. In the next section, we propose trary coordinate system. However, following common an alternate formulation using spectral graph theoretic practice, we consider only Cα atoms of each residue. techniques. Thus, we represent protein structures as a set of points corresponding to residues in 3 dimensions. So, a 2.2. Spectral Solution to Protein Structure protein P is represented as P = {p1,...,pn} where Alignment 3 pi ∈ R , 1 ≤ i ≤ n. A structural alignment between A B A Consider the ideal situation where both protein P A two proteins P and P is a 1-1 mapping φ : {i|pi ∈ ¯A B ¯B ¯A A ¯B B and P B have equal number of residues, all of which P } → {j|pj ∈ P }, where P ⊆ P and P ⊆ P . The mapping φ defines a set of correspondences be- have a corresponding residue in the other structure. tween the residues of the two proteins. |P¯A| = |P¯B| In such a case, the problem of finding optimal align- is called the length of the alignment. Given a struc- ment between the two proteins is same as finding the tural alignment φ between 2 structures P A and P B, optimal permutation of the residues of one of the pro- A B and a transformation T of structure B onto A, a teins. Let A and A be the adjacency matrices popular measure of the goodness of superposition is corresponding to the two proteins. Consider their A n A A A T the root mean square deviation (RMSD), defined as eigenvalue decompositions A = i=1 λi ζi (ζi ) B n B B B T P 1 A B 2 and A = i=1 λi ζi (ζi ) , with terms on the RHS RMSD(φ) = ¯A pA∈P¯A (pi −T (pφ i )) Given P q |P | P i ( ) sorted according to decreasing value of λ. It is easy to an alignment φ, the optimal transformation T , mini- A B see that the eigenvectors ζi and ζi will be related by mizing RMSD, can be computed in closed form using the same permutation as the residues of the original the method described in (Horn, 1987). proteins. Thus, the correct alignment can be retrieved However, using RMSD as a measure for evaluat- by calculating the permutation that best matches the A B ing alignments has 2 problems: the optimal trans- entries of ζi and ζi . formation T needs to be computed for every align- The adjacency matrix A of any protein can be ap- ment, and RMSD favors alignments of lower lengths proximated by k eigenvectors with an error η, given which may not capture the total similarity between k T 2 n 2 2 2 by: kA − i=1 λiζiζi kF = j=k+1 λj = η kAkF . 2 proteins in presence of noise. An alternate mea- For k = 1,Pη is minimized whenP the eigenvector cor- sure, called the distance RMSD avoids the calcu- responding to the largest eigenvalue, ζ1, is considered. lation of optimal transformation. Distance RMSD th The j component of the leading eigenvector, ζ1(j) for an alignment φ is defined as RMSDD(φ) = can be thought of as a projection of the corresponding 1 A B 2 A ¯A 2 pA,pA∈P¯A (dij − d ) where, dij is the residue on to the real line. It was also observed that q |P | P i j φ(i)φ(j) A A spatially close residues have similar projected values; distance between residues pi and pj . The matrix d is i.e. if Ajl is high, then |ζ1(j) − ζ1(l)| is low. Hence, ζ1 Alignment Kernels for Protein Structures can be thought of as projections that preserve neigh- 2. For each pair of sub-structures, one from each borhood (Bhattacharya et al., 2006). protein, compute the alignment between the sub- structures by solving the problem described in Let ζA and ζB be the eigenvectors corresponding to 1 1 equations 1 and 2. the largest eigenvalues of the adjacency matrices of A B proteins P and P . Define spectral projection vector 3. For each sub-structure alignment computed f, to be a vector of absolute values of components of above, compute the optimal transformation of the A A B B A B ζ.