Advances and Pitfalls of Protein Structural Alignment Hitomi Hasegawa1 and Liisa Holm1,2
Total Page:16
File Type:pdf, Size:1020Kb
Available online at www.sciencedirect.com Advances and pitfalls of protein structural alignment Hitomi Hasegawa1 and Liisa Holm1,2 Structure comparison opens a window into the distant past of illustrated in a review [1]. Visual recognition of recurrent protein evolution, which has been unreachable by sequence folding patterns between unrelated proteins suggested comparison alone. With 55 000 entries in the Protein Data Bank that there might exist a finite number of admissible folds. and about 500 new structures added each week, automated Hierarchical taxonomies of protein structures were pro- processing, comparison, and classification are necessary. A posed wherein clusters of homologous proteins are nested variety of methods use different representations, scoring within folds, and every fold is slotted into one of four functions, and optimization algorithms, and they generate classes. Automated clusterings were generated, which contradictory results even for moderately distant structures. were in broad general agreement with manual and semi- Sequence mutations, insertions, and deletions are automated classifications. Lately, the hierarchical view accommodated by plastic deformations of the common core, with discrete folds has given way to a view where related retaining the precise geometry of the active site, and peripheral structures form elongated clusters in fold space and folds regions may refold completely. Therefore structure comparison can be morphed continuously one into another [2,3]. methods that allow for flexibility and plasticity generate the most biologically meaningful alignments. Active research There is no universally acknowledged definition of what directions include both the search for fold invariant features and constitutes structural similarity but we all know it when the modeling of structural transitions in evolution. Advances we see it. There is a strong tradition to visualize structural have been made in algorithmic robustness, multiple alignment, alignments by least-squares superimposition, which treats and speeding up database searches. the structures as rigid 3D objects. Other representations, Addresses such as distance difference matrices, carry detailed infor- 1 Institute of Biotechnology, University of Helsinki, P.O. Box 56 mation about internal motions (Figure 1a). The spectrum (Viikinkaari 5), 00014 University of Helsinki, Finland of structural alignment methods includes rigid, flexible, 2 Department of Biological and Environmental Sciences, University of and elastic aligners, which differ in their treatment of Helsinki, P.O. Box 56 (Viikinkaari 5), 00014 University of Helsinki, Finland structural variations. At the structural level, mutations Corresponding author: Holm, Liisa (liisa.holm@helsinki.fi) manifest in plastic deformations, shifts, and rotations of the secondary structure elements (SSEs). For example in the globin family, their cumulative effect reaches 7 A and Current Opinion in Structural Biology 2009, 19:341–348 608 [4]. Structural alignments are typically local to dis- This review comes from a themed issue on tinguish between the common core and regions where Sequences and Topology one-to-one structural correspondences cannot be estab- Edited by Anna Tramontano and Adam Godzik lished. Available online 27th May 2009 The most widespread purpose of structural alignment has 0959-440X/$ – see front matter been to identify homologous residues (encoded by the # 2009 Elsevier Ltd. All rights reserved. same codon in the genome of a common ancestor). In DOI 10.1016/j.sbi.2009.04.003 sequence comparison, this is typically achieved by mod- eling amino acid substitution as well as insertion and deletion probabilities [5]. A few quantitative, empirically parameterized models of structural evolution have been Introduction proposed [6,7–9,10], but most structural aligners are Comparative analyses of protein sequences and structures based on ad hoc scores of structural similarity. Never- play a fundamental role in understanding proteins and their theless, some ad hoc scores have been shown to work in functions. The total variety of protein structures is con- practice indicating that they, too, capture qualitative siderably smaller than the variety of protein sequences. aspects of structural evolution [11,12]. This is commonly explained as the result of physical limitations and the evolutionary history of natural proteins. Protein structural alignment is an active field of research. Assuming an evolutionary continuity of structure and The number of new methods published per year has been function, describing the structural similarity relationships doubling every 5 years for the last 30 years; this estimate is between protein structures allows scientists to infer the based on a sample of published papers in ISI Web of functions of newly discovered proteins. Knowledge, which have structural alignment or structure comparison in the title. The methods use different The foundations for structure comparison were laid down representations, scoring functions, and optimization by visual analysis, when all known structures could be algorithms. Ranking methods with different views on www.sciencedirect.com Current Opinion in Structural Biology 2009, 19:341–348 342 Sequences and Topology Figure 1 Current Opinion in Structural Biology 2009, 19:341–348 www.sciencedirect.com Protein structural alignment Hasegawa and Holm 343 similarity is not straightforward, and some evaluations deviations are permitted for tertiary contacts (e.g. be- have produced paradoxical results. Here, we review tween helices or beta-sheets) than the local ones (e.g. methodologies for structure comparison and the compari- between backbone conformations). son of methods. 1D. Structural profiles are a very popular approach among Scores computer scientists. These profiles classify each residue Different scoring schemes can be classified into types according to its amino acid type and discrete backbone depending on whether the structural representation is conformational state. Fast string algorithms, as used for three-dimensional, two-dimensional or one-dimensional amino acid sequences, are directly applicable to the or one number characterizing the whole structure. A structural alphabets [9,12,19–23]. However, these profiles selection of functional forms used in structural similarity have limited power to detect structural similarity between scores is collated in Table 1. proteins that have large embellishments on top of the common core. 3D. The similarity of 3D objects can be quantified by the positional deviations of equivalent atoms upon rigid-body 0D. The ultimate reduction projects an entire structure to superimposition. Numerous scoring functions have been its fingerprint, that is, one number or histogram proposed. Depending on the balance between the size of [24,25,26–28]. Index lookups allow the fastest possible the common core (Ne), positional deviations (rmsd), and database searches. Similar structures should generate gap penalties, these criteria can define different sets of identical (or similar) fingerprints, but substructure match- optimal correspondences (Figure 1a). Instead of a single ing is problematic. For example, Tableausearch [25]is rigid-body transformation, flexible aligners chain together based on the hypothesis that folds have invariant, discrete a series of substructures, which have tight local super- patterns of secondary structure arrangements. Finger- impositions. As a result, flexible aligners can identify prints are also stored for all substructures generated by similarities between proteins with large conformational deleting one or two SSEs. Unfortunately, index based changes. Flexible aligners have proliferated in recent search of approximate matches to more distant structures years [6,13,14,15,16]. is prohibitive because of the combinatorial explosion. 2D. Structurally equivalent residues have similar tertiary Alignment interactions. The residue–residue interactions are Once a structural similarity score has been chosen, the described, for example, by contact maps, graphs, or dis- alignment problem consists of finding the optimal set of tance matrices. Contact maps can be generated by topo- correspondences. The evaluation of similarities over all logical analysis of protein structures. Delaunay and interactions leads to sum-of-pairs problems which are NP- Voronoi tessellations create a mesh grid inside the protein hard. In contrast, rmsd-based scores can be optimized in structure such that every point of space is assigned to the polynomial time [29]. All practical algorithms for struc- nearest residue [17,18,19]. Residues, which are spatial tural alignment employ heuristics. The most commonly neighbors, share a facet, edge, or vertex of a mesh cell. used approaches are an educated guess of the rigid-body Using these topological criteria, there is no need to transformation (superimposition of un-gapped segments specify a distance threshold of contacting residues. Con- or internal coordinate frames), fragment assembly in- served neighbor relationships have been proposed as an cluding graph extension algorithms (combinatorial pro- objective definition of the common core [18]. These blems), and double dynamic programing (yielding an methods allow flexibility as long as interaction networks approximate solution to sum-of-pairs problems [30]). are conserved. Dali is an elastic aligner. Similarity is Many programs mix several approaches. defined proportional to the relative distance differences of intramolecular C(alpha)–C(alpha)