LNA: Fast Protein Classification Using a Laplacian Characterization of Tertiary Structure Nicolas Bonnel, Pierre-François Marteau
Total Page:16
File Type:pdf, Size:1020Kb
LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure Nicolas Bonnel, Pierre-François Marteau To cite this version: Nicolas Bonnel, Pierre-François Marteau. LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure. IEEE/ACM Transactions on Computational Biology and Bioinformatics, Institute of Electrical and Electronics Engineers, 2012, 9 Issue: 5, pp.1451 - 1458. 10.1109/TCBB.2012.64. hal-00639663 HAL Id: hal-00639663 https://hal.archives-ouvertes.fr/hal-00639663 Submitted on 9 Nov 2011 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. LNA: Fast Protein Classification Using A Laplacian Characterization of Tertiary Structure Nicolas Bonnel and Pierre-Francois Marteau IRISA-UBS, Universit´ede Bretagne Sud, France [email protected], [email protected] November 9, 2011 Abstract Proteins having similar functional properties may have a very different primary structure, but they usu- 1 Motivation: ally share similar tertiary structures. During the past two decades, a lot of tertiary structures have been In the last two decades, a lot of protein 3D shapes discovered, and 3D fold protein data banks such as have been discovered, characterized and made avail- PDB [4] are quickly growing (PDB holds 79000 pro- able thanks to the Protein Data Bank (PDB), that tein files as of November 2011). It is therefore crucial is nevertheless growing very quickly. New scalable to be able to efficiently search through this database. methods are thus urgently required to search through Best methods designed to compare or match pro- the PDB efficiently. tein 3D structures involve heavy computations. [16] compares Cα distance matrices, [30] extendedly com- bines aligned fragment pairs based on local geometry, 2 Results: [18] compares lists of feature of primary, secondary and tertiary structures, [34, 28] maximize the TM- We present in this paper an approach entitled LNA score which is independent of protein lengths con- (Laplacian Norm Alignment) that performs struc- trary to the Root Mean Square Deviation (RMSD: tural comparison of two proteins with dynamic pro- [19]). [35] compares intra-molecular residue-residue gramming algorithms. This is achieved by charac- relationship. [14, 22] first align secondary struc- terizing each residue in the protein with scalar fea- ture elements (SSE) and then refine the alignment tures. The feature values are calculated using a to residue level. Some approaches are based on con- Laplacian operator applied on the graph correspond- tact maps matching which is though to be NP-hard ing to the adjacency matrix of the residues. The as descibed by [15]: [6, 3, 12]. [27] uses double dy- weighted Laplacian operator we use estimates at var- namic programming on vector of Cβ distances and is ious scales local deformations of the topology where used to build the CATH [9] classification. each residue is located. On some benchmarks widely On the other hand, a few methods are a lot faster, shared by the community we obtain qualitatively sim- nevertheless at the cost of results quality. [33] en- ilar results compared to other competing approaches, hances BLAST [2] by using a structural alphabet and but with an algorithm one or two order of magni- corresponding substitution matrix. [23, 24] enhance tudes faster. 180,000 protein comparisons can be BLAST using an alphabet corresponding to residue done within 1 seconds with a single recent GPU, positions in the Ramachandran plot [29]. [7] char- which makes our algorithm very scalable and suitable acterizes proteins as sequences of α angles (dihedral for real-time database querying across the Web. angle between four consecutive α carbons) and com- pares them by searching common fixed length pat- 3 Introduction terns. [21] performs a global alignment of structural signatures. Proteins are made of a sequence of amino acids We present in this paper a novel way for describing (residues) that folds in a 3 dimensional structure. protein 3D structures. While Cartesian coordinates 1 of a residue are dependent upon a reference coordi- rotation. nate system and only give information about its loca- The tertiary structure of a protein is usually en- tion, differential coordinates carry information about coded as sequence of 3D Cartesian space coordinates the local topology and the orientation of local struc- for each atom of each amino acid. We simplify this tural details. We use the norm of Laplacian coordi- sequence by only keeping carbon alpha (Cα) 3D po- nates of residues at different scales to encode a se- sitions for each residue. quence of multivariate local deformation descriptors. The norm of the Laplacian coordinates presents the 4.1 Computation of the Laplacian co- great advantage to be invariant to translation and ordinates of each residue rotation. The Laplacian operator have been used in various A protein P of length n is defined by a sequence of 3D P methods, mainly for 3D computer vision and 3D mod- coordinates p1,p2,...,pn. Let Ω (σ) be the weighted eling. [25] performs articulated shape matching. [11] adjacency matrix of an undirected graph with one P uses this operator to smoothen, enhance or check the node per residue for the protein P . The weight Ωij(σ) quality of triangular meshes. of each edge eij for 1 ≤ i,j ≤ n is computed using a Dynamic programming algorithms can then be Gaussian kernel: used to match these multivariate descriptor sequences and achieve similar classification or clustering results 2 −kpi−pj k compared to other approaches. These dynamic pro- P e σ2 if |i − j| > 1 2 Ωij(σ)= gramming algorithms in O(n ) time complexity re- 0 otherwise and in this case there is no edge quire few memory and can be efficiently implemented (1) on Graphical Processing Unit (GPU) for an even Let the diagonal matrix DP (σ) be defined as: faster computation time. This allows to compare one protein structure against 180,000 others within a sec- P P D (σ)= Ω (σ) (2) ond time span. ii ij j In the following, we describe how we compute the X Laplacian coordinates of residues and their norm. We The discrete Laplace operator LP (σ) of the protein then present two dynamic programming algorithms P is given by: derived from [26] and [31] for global and local align- −1 ment respectively and compare our approach with LP (σ)= I − DP (σ) ΩP (σ) (3) state of the art methods on two protein datasets known to be difficult to categorize. We show that By applying this operator to the vector of the our approach outperforms the fastest know methods residue 3D Cartesian positions, we obtain a vector of and provides qualitatively similar results than most 3D Laplacian coordinates for each residue. Finally, of the best known approaches, with algorithms one we build a sequence P (σ) = p1(σ), p2(σ),..., pn(σ) or two order of magnitude faster. of Euclidean norms of the Laplacian coordinates for each residue, as showne on figuree 1. e e Since we use a Gaussian kernel, the norm of the 4 Method and algorithm Laplacian coordinates is usually higher in the periph- ery of the protein than in the center, as shown on Our method, called Laplacian Norm Alignment figure 2(b). Hence, beta sheets generate a lowly de- (LNA), characterizes proteins as a multi-dimensional formed surface, the norm of Laplacian coordinates of sequence of quantity of deformation at different scales their residues is quite low. A contrario coils can have of the local space in which residues are embedded. a high norm of Laplacian coordinates corresponding More precisely, the discrete Laplace operator (dis- to their residues if they are found in the periphery crete Laplacian) is used to measure the divergence of the protein. Alpha helices rarely go through the of the gradient of residue positions in a graph that center of the protein: for each alpha helix, a ’side’ describes a protein weighted adjacency map. Lapla- is near the center of the protein and the other near cian coordinates of residues are resilient to translation the periphery. Consequently, consecutive residues in but not to rotation. This limitation is overcome by the chain have a norm of their Laplacian coordinates only keeping for each residue the Euclidean norm of that vary: thus, alpha helices correspond to a saw- its Laplacian coordinates that is indeed invariant to tooth shape as shown on figure 2(a). 2 (0,2) 2 norms of Laplacian coordinates: one corresponding 2 to coordinates computed with σ < 10A˚ (local de- (2,1) (-2,1) 0.64 scription) and the other with σ > 20A˚ (more global 1 0.17 3 description). 0.57 (0,0) Hence a protein P of length n can be characterized 4 by a sequence P = p1, p2,..., pn of k-dimensional 0.15 0.33 0.14 vectors defined as follows: e e e e pi =[pi(σ1), pi(σ2),..., pi(σk)], 1 ≤ i ≤ n (4) 5 (-1,-2) wheree σe1,σ2,...,σe k are alle in [2; 50]. (a) Graph ΩP (3) with its weights (solid lines). 4.3 Sequence comparison 1 0 −0.16 −0.53 −0.31 We propose 2 dynamic programming approaches for 0 1 0 −0.81 −0.19 global and local comparison of proteins. Given two −0.55 0 1 0 −0.45 sequences P and Q of respective length m and n, both −0.47 −0.53 0 1 0 approaches perform computations in O(m.n) time, −0.53 −0.24 −0.23 0 1 and use k-dimensionale e vectors for the characteriza- (b) Laplacian operator LP (3) tion of residues.