Available online at www.sciencedirect.com

Advances and pitfalls of structural alignment Hitomi Hasegawa1 and Liisa Holm1,2

Structure comparison opens a window into the distant past of illustrated in a review [1]. Visual recognition of recurrent protein , which has been unreachable by sequence folding patterns between unrelated suggested comparison alone. With 55 000 entries in the that there might exist a finite number of admissible folds. and about 500 new structures added each week, automated Hierarchical taxonomies of protein structures were pro- processing, comparison, and classification are necessary. A posed wherein clusters of homologous proteins are nested variety of methods use different representations, scoring within folds, and every fold is slotted into one of four functions, and optimization algorithms, and they generate classes. Automated clusterings were generated, which contradictory results even for moderately distant structures. were in broad general agreement with manual and semi- Sequence mutations, insertions, and deletions are automated classifications. Lately, the hierarchical view accommodated by plastic deformations of the common core, with discrete folds has given way to a view where related retaining the precise geometry of the active site, and peripheral structures form elongated clusters in fold space and folds regions may refold completely. Therefore structure comparison can be morphed continuously one into another [2,3]. methods that allow for flexibility and plasticity generate the most biologically meaningful alignments. Active research There is no universally acknowledged definition of what directions include both the search for fold invariant features and constitutes structural similarity but we all know it when the modeling of structural transitions in evolution. Advances we see it. There is a strong tradition to visualize structural have been made in algorithmic robustness, multiple alignment, alignments by least-squares superimposition, which treats and speeding up database searches. the structures as rigid 3D objects. Other representations, Addresses such as distance difference matrices, carry detailed infor- 1 Institute of Biotechnology, University of Helsinki, P.O. Box 56 mation about internal motions (Figure 1a). The spectrum (Viikinkaari 5), 00014 University of Helsinki, Finland of structural alignment methods includes rigid, flexible, 2 Department of Biological and Environmental Sciences, University of and elastic aligners, which differ in their treatment of Helsinki, P.O. Box 56 (Viikinkaari 5), 00014 University of Helsinki, Finland structural variations. At the structural level, mutations Corresponding author: Holm, Liisa (liisa.holm@helsinki.fi) manifest in plastic deformations, shifts, and rotations of the secondary structure elements (SSEs). For example in the globin family, their cumulative effect reaches 7 A and Current Opinion in Structural Biology 2009, 19:341–348 608 [4]. Structural alignments are typically local to dis- This review comes from a themed issue on tinguish between the common core and regions where Sequences and Topology one-to-one structural correspondences cannot be estab- Edited by Anna Tramontano and Adam Godzik lished.

Available online 27th May 2009 The most widespread purpose of structural alignment has 0959-440X/$ – see front matter been to identify homologous residues (encoded by the # 2009 Elsevier Ltd. All rights reserved. same codon in the genome of a common ancestor). In DOI 10.1016/j.sbi.2009.04.003 sequence comparison, this is typically achieved by mod- eling substitution as well as insertion and deletion probabilities [5]. A few quantitative, empirically parameterized models of structural evolution have been Introduction proposed [6,7–9,10], but most structural aligners are Comparative analyses of protein sequences and structures based on ad hoc scores of structural similarity. Never- play a fundamental role in understanding proteins and their theless, some ad hoc scores have been shown to work in functions. The total variety of protein structures is con- practice indicating that they, too, capture qualitative siderably smaller than the variety of protein sequences. aspects of structural evolution [11,12]. This is commonly explained as the result of physical limitations and the evolutionary history of natural proteins. Protein structural alignment is an active field of research. Assuming an evolutionary continuity of structure and The number of new methods published per year has been function, describing the structural similarity relationships doubling every 5 years for the last 30 years; this estimate is between protein structures allows scientists to infer the based on a sample of published papers in ISI Web of functions of newly discovered proteins. Knowledge, which have structural alignment or structure comparison in the title. The methods use different The foundations for structure comparison were laid down representations, scoring functions, and optimization by visual analysis, when all known structures could be algorithms. Ranking methods with different views on www.sciencedirect.com Current Opinion in Structural Biology 2009, 19:341–348 342 Sequences and Topology

Figure 1

Current Opinion in Structural Biology 2009, 19:341–348 www.sciencedirect.com Protein structural alignment Hasegawa and Holm 343

similarity is not straightforward, and some evaluations deviations are permitted for tertiary contacts (e.g. be- have produced paradoxical results. Here, we review tween helices or beta-sheets) than the local ones (e.g. methodologies for structure comparison and the compari- between backbone conformations). son of methods. 1D. Structural profiles are a very popular approach among Scores computer scientists. These profiles classify each residue Different scoring schemes can be classified into types according to its amino acid type and discrete backbone depending on whether the structural representation is conformational state. Fast string algorithms, as used for three-dimensional, two-dimensional or one-dimensional amino acid sequences, are directly applicable to the or one number characterizing the whole structure. A structural alphabets [9,12,19–23]. However, these profiles selection of functional forms used in structural similarity have limited power to detect structural similarity between scores is collated in Table 1. proteins that have large embellishments on top of the common core. 3D. The similarity of 3D objects can be quantified by the positional deviations of equivalent atoms upon rigid-body 0D. The ultimate reduction projects an entire structure to superimposition. Numerous scoring functions have been its fingerprint, that is, one number or histogram proposed. Depending on the balance between the size of [24,25,26–28]. Index lookups allow the fastest possible the common core (Ne), positional deviations (rmsd), and database searches. Similar structures should generate gap penalties, these criteria can define different sets of identical (or similar) fingerprints, but substructure match- optimal correspondences (Figure 1a). Instead of a single ing is problematic. For example, Tableausearch [25]is rigid-body transformation, flexible aligners chain together based on the hypothesis that folds have invariant, discrete a series of substructures, which have tight local super- patterns of secondary structure arrangements. Finger- impositions. As a result, flexible aligners can identify prints are also stored for all substructures generated by similarities between proteins with large conformational deleting one or two SSEs. Unfortunately, index based changes. Flexible aligners have proliferated in recent search of approximate matches to more distant structures years [6,13,14,15,16]. is prohibitive because of the combinatorial explosion.

2D. Structurally equivalent residues have similar tertiary Alignment interactions. The residue–residue interactions are Once a structural similarity score has been chosen, the described, for example, by contact maps, graphs, or dis- alignment problem consists of finding the optimal set of tance matrices. Contact maps can be generated by topo- correspondences. The evaluation of similarities over all logical analysis of protein structures. Delaunay and interactions leads to sum-of-pairs problems which are NP- Voronoi tessellations create a mesh grid inside the protein hard. In contrast, rmsd-based scores can be optimized in structure such that every point of space is assigned to the polynomial time [29]. All practical algorithms for struc- nearest residue [17,18,19]. Residues, which are spatial tural alignment employ heuristics. The most commonly neighbors, share a facet, edge, or vertex of a mesh cell. used approaches are an educated guess of the rigid-body Using these topological criteria, there is no need to transformation (superimposition of un-gapped segments specify a distance threshold of contacting residues. Con- or internal coordinate frames), fragment assembly in- served neighbor relationships have been proposed as an cluding graph extension algorithms (combinatorial pro- objective definition of the common core [18]. These blems), and double dynamic programing (yielding an methods allow flexibility as long as interaction networks approximate solution to sum-of-pairs problems [30]). are conserved. Dali is an elastic aligner. Similarity is Many programs mix several approaches. defined proportional to the relative distance differences of intramolecular C(alpha)–C(alpha) distances. This A number of new robust optimization algorithms have means that, for the same level of similarity, larger absolute been introduced for various problem settings including

(Figure 1 Legend ) (a) Schematic illustration of rigid, flexible, and elastic structure comparison. The lower, lighter structure is related to the upper, darker structure by a translation (by d) of the right-side circular set of points relative to the left-side circles. Top right: rigid superimposition has to balance between lower rmsd and larger set of equivalent points (Ne). In this case, one can obtain a perfect superimposition of five points, or a worse superimposition of all eight points. The gray ellipses denote the equivalent pairs. Bottom left: flexible superimposition breaks the structure up into several rigid substructures and applies a different rigid-body transformation to each substructure. Link penalties control fragmentation. Bottom right: distance matrices and contact maps are representations of protein structure, which are independent of the coordinate frame. Distance difference matrices identify both structural conservation and motion between substructures. (b) The main purpose of structural alignment is to identify homologous residues. Urease and adenosine deaminase are members of a large superfamily with a common active site. Top: ribbon diagrams of superimposed structures (urease is white and adenosine deaminase is gray). Note extensive embellishments. Bottom: highlighting the conserved metal-binding and catalytic residues in the structural superimposition (by DaliLite server, with Ne = 210 equivalent C(alpha) atoms, rmsd = 3.5 A and 13% sequence identity). The highlighted residues are His137, His139, His249, His275, and Asp363 in urease (PDB entry 1ie7, chain C; bottom left) and His391, His393, His659, His681, and Asp736 in adenosine deaminase (PDB entry 2a3l, chain A; bottom right). www.sciencedirect.com Current Opinion in Structural Biology 2009, 19:341–348 344 Sequences and Topology

Table 1

Quantification of structural similarity

Type Function (maximized Comments Used in unless otherwise stated) rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP N 3D e d2 Root mean square Rigid and flexible aligners rmsd ¼ i¼1 i Ne positional deviation

3D Maximize Ne, rmsd being a constraint Iterative superimposition– ProSup [60], MAMMOTH [61] realignment (final pass), CE [62] (final), LGA/GDT [56]

3D Minimize rmsd, Ne being a constraint LOVOalign [14] 3D rmsd100 (to be minimized) SAS score [11] Ne 3D rmsd 100 GSAS score [11] if Ne > Ngaps Ne Ngaps 99:9 otherwise 3D XNe ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi TM-score TM-align, Fr-TM-align [63] p3 2 1=ð1 þðdi=1:24 NB 15 1:8Þ Þ i¼1 3D 3Ne S score SARF2 [59], MatAlign [64] 1þrmsd 3D XNe STRUCTAL score STRUCTAL [65], 2 ð20=ð1 þ di =5ÞÞ 10Ngaps LOVOalign [14] i¼1 XNe 3D 2 Differentiable GASH [58], RASH [66] eðdi=4Þ i¼1 2 3D Ne Q-score SSM [40] 2 NANBð1þðrmsd=3Þ Þ 3D rmsdþa (to be minimized) a is the number of unaligned GANGSTA [13] N bgþ105 e SSEs in A, b is the contact map overlap, g is the relative similarity P of SSE pair distances 3D blocks similarity of General form optimized by CE [62] (initial), FATCAT [41], P flexible aligners FlexProt [67], Matt [15], blocks þ link penalties links RAPIDO [16], PPM [6], see note P 3D ssa pði; jÞ¼ 500 . Dynamic programing over the SSAP [55] m 2 A jjVi ! mV j ! njjþ10 n 2 B ssap matrix, where i 2 A, j 2 B 2D XNe General form optimized by double Vorolign [17], Vorometric [19] similarity of residue dynamic programing algorithms i¼1 environments ðiÞþgap penalties 2D XNe XNe  Elastic Dali score Dali [43 ], K2 [68] Dði; jÞ ðd =20Þ2 dali ¼ 0:2 e ij þ 0:2Ne dij i¼1 j ¼ 1 j 6¼ i

2D dali 0.2Ne MUSTANG [37]

XNe XNe no 2D 2 bb terms includes three FAST [69] 1 max bb terms; 10 Dði; jÞ eðdij=20Þ dij comparisons of i¼1 j ¼ 1 backbone directionality j 6¼ i 2D Maximal common subgraph TOPS [70], TOPOFIT [18] 2D/1D Log-odds scores for residue state, Matras [8] inter-residue distance, and SSE arrangement transitions 1D XNe General form of 1D alignment YAKUSA [42], 3D-Blast [21], similarity of residue ðiÞ MAMMOTH [61] (initial) i¼1 descriptors þ gap penalties Norm rmsdminðNA;NBÞ (to be minimized) Similarity index (SI) [11] Ne

Norm 1 1þNe Match index (MI) [11] ð1þrmsd=1:5Þð1þminðNA;NBÞÞ  Norm daliðffiffiffiffiffiffiffiffiu \Þ Dali’s empirical Z-score. The Dali [43,71] maxu 2 PUUðAÞ; 2 PUUðBÞ 0:5 p 1 mð NuNÞ dali score includes only mðLÞ¼7:95 þ 0:71L þ 0:00026L2 0:0000019L3 residues that belong to protein unfolding units u and Norm 0:39rmsd2 Topofit’s empirical Z-score TOPOFIT [18 ] 0:25Nee 1:7

Current Opinion in Structural Biology 2009, 19:341–348 www.sciencedirect.com Protein structural alignment Hasegawa and Holm 345

Table 1 (Continued ) Type Function (maximized Comments Used in unless otherwise stated)

Ne and rmsd of ‘topomax point’ P P Norm Ne Ne Normalized SSAP score CATH [51] 100ln 500=ðjjVA VB jjþ10Þ=ðN211Þ i¼1 j i ! j i ! j e lnð50Þ 80–100: highly similar; 70–80 and

Ne/max(NA,NB) > 60%: similar fold; 60–70: different fold, same class Norm 100 2Ne Normalized similarity. qCOPS [53] NAþNB 100–99: equivalent; 99–90: similar; 90–75: related; 75–50: remote; 50–30: distant; 30–0: unrelated

We use the notation di = jjai (Ubi + T)jj is the C(alpha)–C(alpha) distance of the ith aligned residue pair, U is the rotation matrix and T is the translation vector of the least-squares superimposition of structure B onto the coordinate frame of structure A. Ne is the number of equivalent (aligned) residues. NA and NB are the number of residues in structures A and B, respectively. Ngaps is the number of gaps in the alignment. The ai and bi vectors hold the C(alpha) coordinates (in Angstrom units) of the ith equivalent residue in structures A and B, respectively, and jjxjj is the vector norm. D(i,j)=jjjai ajjj jjbi bjjj j is the (i,j) element of the distance difference matrix and the average of the corresponding distance matrix elements is dij =(jjai aj jj + jjbi bjjj)/2. An internal coordinate frame can be defined based on the backbone atoms of an amino acid. Vi!m denotes the vector from the C(beta) atom of residue i to the C(beta) atom of residue m in the coordinate frame of residue i. PUU is a decomposition of protein unfolding units. Cited structure comparison programs had a functional website. Note: PPM’s similarity score has the form 8 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi > > 0:0 if rmsd > 0:5 þ N e þ 1 <> pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 4 1:0 if rmsd < 0:5 þ N e þ 1 simðN e; rmsdÞ¼> pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi > rmsd þ 0:5 4 N þ 1 :> ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffie p p4 otherwise N e þ 1 N e þ 1

and the cost of combining blocks is 4(1.0 sim(N1 + N2, max(rmsd1, rmsd2))), where rmsd1 and rmsd2 are the rmsd resulting from superimposing the other of a pair of aligned blocks. rmsd optimization [14], multiple structure alignment [7,8,13,18,40–42]. For slower methods, the investment [15], fragment assembly [6,31], and the alignment of in precomputing structural relationships within the PDB interfaces in protein complexes [32]. Fragment assembly pays off, as many dissimilar structures can be excluded algorithms are combinatorial and can generate nonse- based on triangulated distances without explicit align- quential alignments [13,33–35]. Consistency scoring ment [19,43,44]. Dali’s similarity score is not a metric [36] has been used in the generation of multiple align- and it uses rules of thumb to focus the search to the ments from libraries of pairwise alignments [37,38]. neighborhoods of previously found strong matches [43]. It can take advantage of fast filters to connect the query to Database retrieval and classification the network of similarities, and then collects the whole Nowadays, newly solved structures are routinely neighborhood by walking deeper into the graph. scanned against structures deposited in the Protein Data Bank. The goal is to retrieve all structures with a Structural classifications also seem to lag increasingly significant similarity to the query structure. The related behind. The Dali Database of all versus all alignments classification problem requires a threshold for homology is currently updated on a twice-yearly schedule [43]. At or fold similarity [3]. Normalized scores (labeled ‘Norm’ the time of writing, the latest CATH release dates to July in Table 1) take the size of the proteins into account so 2008. SCOP’s last stable release (v.1.73) is from 2007; a that proteins can be compared all versus all on the same current beta version is available. scale. Wrabl and Grishin [39] propose a universal P- value as a function of molecular shape parameters, rmsd Evaluation and Ne based on theoretical considerations of random Common evaluation tests measure first, the accuracy of structures; unexpectedly, this P-value shows a fair cor- the alignments; second, the ability of the alignment score relation with the empirical Z-score of a totally unrelated to discriminate homologous from unrelated proteins in method, Dali. database-wide comparisons; and, third, the ‘quality’ of alignments. The first two depend on a manually curated A practical challenge faced by protein structure compari- reference alignment (e.g. HOMSTRAD [45], CDD son programs is that they either have to run ever faster, or [46,47], Sisyphus [48], BIRC [49]), or classification (e.g. evolve new strategies, to cope with the ever-increasing SCOP [50] and CATH [51]) as the standard of truth. In amount of known structures. Very fast programs can alignment quality assessment, the alignments generated afford to perform one against all comparisons by the evaluated method are rescored for ‘quality’ with an www.sciencedirect.com Current Opinion in Structural Biology 2009, 19:341–348 346 Sequences and Topology

(rmsd, Ne)-based scoring function, and this is called alignments can agree on essentials, even when they get reference-independent evaluation [11,52]. unequal ‘quality’ scores. Not all programs using the same type of score generated similar alignments. Therefore Reference-independent evaluation has been proposed to developers should pay special attention to the robustness alleviate some problems of explicit reference alignments. of optimization protocols. For example, structures with repeats have translational symmetry, thus many reasonable alternative alignments Conclusions though only one may occur in the reference. Database- The goal of structural alignment can be defined geome- wide discrimination tests are based on ROC curves with trically, as the superimposition of a maximal number of SCOP as the most popular reference. Many researchers points with minimal spatial deviations, or genetically, as have remarked on inconsistencies in SCOP with respect the identification of homologous residues. Mutations in to quantitative criteria [3]. As a remedy, qCOPS [53] protein sequences lead to plastic deformations of the divides structurally divergent SCOP families into sub- three-dimensional structure. Therefore structure com- families with a large common core. Another test set parison methods that allow for flexibility and plasticity consists of structures classified consistently between generate the most biologically meaningful alignments. SCOP and CATH [6]. Accurate pairwise and multiple alignments are needed for the mapping, characterization, and classification of func- The problem with reference-independent evaluation is tional sites, with uses in biomedical studies. Evolutionary that it is a test of the similarity between scoring functions models of the ‘morphing’ of structures in fold space are [47]. rmsd can be computed from any alignment, and emerging and may eventually replace ad hoc scores for consequently alignments generated by any method can structural similarity. Such models will advance the study be positioned in an (rmsd, Ne) plane. Shifting the goal- of the interplay of sequence and structure evolution in the posts between alignment optimization and evaluation (by future, with practical implications for model building by switching the scoring function) leads to unstable beha- homology and structural . vior. rmsd-based criteria are sensitive to the inclusion or exclusion of outliers, that is, the boundaries of the com- Funding mon core, and flexible and elastic scoring functions are Academy of Finland projects #109849 and #114498. not monotonic with respect to rigid rmsd-based functions. We dwell on this point here because the paradox that References and recommended reading some programs produce ‘bad’ alignments (in terms of Papers of particular interest, published within the period of review, have been highlighted as: rmsd), even when they perform well in classification tests [11], is often quoted out of context. In our opinion, the of special interest quality of an alignment is only defined in terms of the of outstanding interest native score of the method used to generate (optimize) 1. Richardson JS: Anatomy and taxonomy of protein structures. the alignment. Adv Protein Chem 1981, 34:167-339. 2. Pascual-Garcı´a A, Abia D, Ortiz AR, Bastolla U: Cross-over We made a diagnostic test of the accuracy of pairwise between discrete and continuous protein structure space: insights into automatic classification and networks of protein structural alignments using urease and adenosine deami- structures. PLoS Comput Biol 2009, 5(March (3)):e1000331. nase as a test case (Figure 1b). Here we are not interested The authors discover a transition in fold space between two regimes with in how many residues are included in the common core. consistent and ambiguous residue-to-residue correspondences, respec- tively. The consistent regime corresponds to superfamilies, which form We merely ask whether the active site is recognized as discrete clusters in fold space. A network model is more appropriate to being structurally equivalent. The active site brings describe similarities between folds in the other regime. together four conserved sequence motifs, which can be 3. Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ: ROC and found using techniques [36]. There- confusion analysis of structure comparison methods identify the main causes of divergence from manual protein fore we consider this a biologically significant motif classification. BMC Bioinform 2006, 7: Art. No. 206. composed of homologous residues [54]. Out of 32 pro- 4. Lesk AM, Chothia C: How different amino-acid-sequences grams or web servers tested, six programs (SSAP [55], determine similar protein structures — structure and LGA/GDT [56,57], TOPOFIT [18], GASH [58], PPM evolutionary dynamics of the globins. J Mol Biol 1980, 136:225. [6], and DaliLite [43]) aligned the functional residues. 5. Loytynoja A, Goldman N: Phylogeny-aware gap placement prevents errors in sequence alignment and evolutionary Ten methods (including Matt [15 ], Matras [8], and analysis. Science 2008, 320:1632-1635. SARF2 [59]) aligned between one and three of the four 6. Csaba G, Birzele F, Zimmer R: Protein structure alignment motifs. Half of the programs aligned none of the motifs. considering phenotypic plasticity. 2008, No particular type of score accounted for success or the 24:I98-I104. PPM is a flexible aligner with good performance in benchmark tests. PPM lack of it. In other words, it seems that the alignment of is a fragment assembly program, which first identifies un-gapped blocks the motifs corresponds to a genuine optimum of structural (pairs of fragments) with low rmsd. A graph is constructed to represent block compatibility. The score of the alignment is given by the quality of similarity, which is present in scoring functions based the blocks (vertex weights) and their linkage (edge weights). Blocks are on different principles and structural representations: linked to each other in a tree (an acyclic subgraph), which allows to obtain

Current Opinion in Structural Biology 2009, 19:341–348 www.sciencedirect.com Protein structural alignment Hasegawa and Holm 347

the provably best linkage path using an algorithm called A-star. The level of Very fast database retrieval of structures with highly similar number and flexibility is controlled by a parameter k such that the kth-best outgoing link arrangement of SSEs. The tableau is a matrix of angles between SSEs. from each block is evaluated. Website: http://www.bio.ifi.lmu.de/PPM. The angles are discretized in bins of 458. The type (alpha or beta) of SSE is written on the diagonal of the matrix. The whole matrix can now be 7. Gibrat J-F, Madej T, Bryant SH: Surprising similarities in expressed as a string. Structures, which have an equal number of SSEs structure comparison. Curr Opin Struct Biol 1996, 6:377-385. and similar fold, generate identical tableaux. Approximate matching with a continuous similarity score is not amenable to the fast index search; a 8. Kawabata T, Nishikawa K: Protein tertiary structure comparison slower double dynamic programing procedure is described. using the Markov transition model of evolution. Proteins 2000, 41:108-122. 26. Sael L, Li B, La D, Fang Y, Ramani K, Rustamov R, Kihara D: Fast protein structure tertiary structure retrieval based on global CLEMAPS: multiple alignment of 9. Liu X, Zhao YP, Zheng WM: surface shape similarity. Proteins: Struct Funct Bioinform 2008, protein structures based on conformational letters. Proteins: 72:1259-1273. Struct Funct Bioinform 2008, 71:728-736. 27. Teichert F, Bastolla U, Porto M: SABERTOOTH: protein Assessment of the probabilities for 10. Viksna J, Gilbert D: structural alignment based on a vectorial structure evolutionary structural changes in protein folds . Bioinformatics representation. BMC Bioinform 2007, 8:425. 2007, 23:832-841. TOPS topology diagrams are compared to model the probabilities of the 28. Zotenko E, Dogan RI, Wilbur WJ et al.: Structural footprinting in creation, deletion, and mutation of SSEs. protein structure comparison: the impact of structural fragments. BMC Struct Biol 2007, 7:53. 11. Kolodny R, Koehl P, Levitt M: Comprehensive evaluation of protein structure alignment methods: scoring by geometric 29. Kolodny R, Linial N: Approximate protein structural measures. J Mol Biol 2005, 346:1173-1188. alignment in polynomial time. Proc Natl Acad Sci U S A 2004, 101:12201-12206. 12. Sierk ML, Pearson WR: Sensitivity and selectivity in protein structure comparison. Protein Sci 2004, 13:773-785. 30. Taylor WR, Orengo CA: Protein structure alignment. J Mol Biol 1989, 208:1-22. 13. Guerler A, Knapp EW: Novel protein folds and their nonsequential structural analogs. Protein Sci 2008, 31. Pelta DA, Gonzalez JR, Vega MM: A simple and fast heuristic for 17:1374-1382. protein structure comparison. BMC Bioinformatics 2008, 9:161. 14. Martinez L, Andreani R, Martinez JM: Convergent algorithms 32. Pulim V, Berger B, Bienkowska J: Optimal contact map for protein structural alignment. BMC Bioinform 2007, 8: Art. alignment of protein–protein interfaces. Bioinformatics 2008, No. 306. 24:2324-2328. 15. Menke M, Berger B, Cowen L: Matt: local flexibility aids protein 33. Abyzov A, Ilyin VA: A comprehensive analysis of non-sequential multiple structure alignment. PLoS Comput Biol 2008, 4: Art. No. alignments between all protein structures. BMC Struct Biol e10. 2007, 7:78. This multiple structure alignment program achieves good superimposi- tions with low rmsd and high Ne. The fragment assembly phase is highly 34. Bhattacharya S, Bhattacharyya C, Chandra NR: Comparison of flexible while the final alignment is refined with respect to rigid-body protein structures by growing neighborhood alignments. BMC superimposition. Website: http://groups.csail.mit.edu/cb/matt/. Bioinform 2007, 8:77. 16. Mosca R, Schneider TR: RAPIDO: a web server for the 35. Dundas J, Binkowski TA, DasGupta B et al.: Topology alignment of protein structures in the presence of independent protein structural alignment. BMC Bioinform conformational changes. Nucleic Acids Res 2008, 36:42-46. 2007, 8:388. 17. Birzele F, Gewehr JE, Csaba G, Zimmer R: Vorolign—fast 36. Heger A, Mallick S, Wilton C, Holm L: The global trace graph, a structural alignment using Voronoi contacts. Bioinformatics novel paradigm for searching protein sequence databases. 2007, 23:e205-e211. Bioinformatics 2007, 23:2361-2367. 18. Leslin CM, Abyzov A, Ilyin VA: TOPOFIT-DB, a database of 37. Konagurthu AS, Whisstock JC, Stuckey PJ, Lesk AM: MUSTANG: protein structural alignments based on the TOPOFIT method. a multiple structural alignment algorithm. Proteins 2006, Nucleic Acids Res 2007, 35:D317-D321. 64:559-574 # fails Diagnostic!. Topological invariants provide an objective way of defining the extent of the common core using the ‘topomax point’. Website: http://mozart.bio.- 38. Pei JM, Tang M, Grishin NV: PROMALS3D web server for neu.edu/topofit/index.php. accurate multiple protein sequence and structure alignments 2008. 19. Sacan A, Toroslu IH, Ferhatosmanoglu H: Integrated search and alignment of protein structures. Bioinformatics 2008, 39. Wrabl JO, Grishin NV: Statistics of random protein 24:2872-2879. superpositions: p-values for pairwise structure alignment. J Comput Biol 2008, 15:317-355. 20. Friedberg I, Harder T, Kolodny R et al.: Using an alignment of fragment strings for comparing protein structures. 40. Krissinel E, Henrick K: Secondary-structure matching (SSM), a Bioinformatics 2007, 23:E219-E224. new tool for fast protein structure alignment in three dimensions. Acta Crystsllogr 2004, D60:2256-2268. 21. Tung CH, Huang JW, Yang JM: Kappa–alpha plot derived structural alphabet and BLOSUM-like substitution matrix for 41. Veeramalai M, Ye YZ, Godzik A: TOPS++FATCAT: fast flexinle rapid search of protein structure database. Genome Biol 2007, structural alignment using constraints derived from TOPS+ 8:R31. string model. BMC Bioinform 2008, 9: Art. No. 358. 22. Wang S, Zheng WM: CLePAPS: fast pair alignment of protein 42. Carpentier M, Brouillet S, Pothier J: YAKUSA: a fast structural structures based on conformational letters. J Bioinform database scanning method. Proteins 2005, 61:137-151. Comput Biol 2008, 6(April (2)):347-366. 43. Holm L, Ka¨ a¨ ria¨ inen S, Rosenstro¨ m P, Schenkel A: Searching 23. Yang JA: Comprehensive description of protein structures protein structure databases with DaliLite v.3. Bioinformatics using shape code. Proteins: Struct Funct 2008, 24:2780-2781. Bioinform 2008, 71:1497-1518. Dali is the oldest running protein structure database search server. Data- base searches are performed by ‘walking’ in fold space. As a representation 24. Chu CH, Tang CY, Tang CY et al.: Angle-distance image of fold space, all versus all comparisons were replaced by a sparser matching techniques for protein structure comparison. J Mol network of similarities, enabling weekly PDB updates in sublinear time Recognit 2008, 21:442-452. relative to database size. Website: http://ekihdna.biocenter.helsinki.fi/dali/ start. 25. Konagurthu Arun S, Stuckey Peter J, Lesk Arthur M: Structural search and retrieval using a tableau representation of protein 44. Sippl M: On distance and similarity in fold space. Bioinformatics folding patterns. Bioinformatics 2008, 24:645-651. 2008, 24:872-873. www.sciencedirect.com Current Opinion in Structural Biology 2009, 19:341–348 348 Sequences and Topology

Distance metrics have useful properties in the organization and searching 58. Standley DM, Toh H, Nakamura H: GASH: an improved algorithm of fold space. Ne, by any local alignment method, can be converted into a for maximizing the number of equivalent residues between metric distance. Website: http://services.came.sbg.ac.at. two protein structures. BMC Bioinform 2005, 6:221. 45. Stebbings LA, Mizuguchi K: HOMSTRAD: recent developments 59. Alexandrov NN, Fischer D: Analysis of topological and of the Homologous Protein Structure Alignment Database. nontopological structural similarities in the PDB: new Nucleic Acids Res 2004:D203-D207 doi: 10.1093/nar/gkh027. examples with old structures. Proteins 1996, 25:354-365. 46. Marchler-Bauer A, Anderson JB, Derbyshire MK, DeWeese- 60. Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS: ProSup: Scott C, Gonzales NR, Gwadz M, Hao L, He S, Hurwitz DI, a refined tool for protein structure alignment. Protein Eng 2000, Jackson JD et al.: CDD: a conserved domain database for 13:745-752. interactive domain family analysis. Nucleic Acids Res 2007, 35(Database issue):D237-D240. 61. Olmea O, Straus CE, Ortiz AR: MAMMOTH (matching molecular models obtained from theory): an automated method for 47. Kim C, Lee B: Accuracy of structure-based sequence model comparison. Protein Sci 2002, 11:2606-2621. alignment of automatic methods. BMC Bioinform 2007, 8:355. 62. Shindyalov IN, Bourne PE: Protein structure alignment by 48. Andreeva A, Prlic A, Hubbard TJP, Murzin A: SISYPHUS — incremental combinatorial extension (CE) of the optimal path. structural alignments for proteins with non-trivial Protein Eng 1998, 11(9):739-747. relationships. Nucleic Acids Res 2007, 35:D253-259. 63. Pandit SB, Skolnick J: Fr-TM-align: a new protein structural 49. Mayr G, Domingues FS, Lackner P: Comparative analysis of alignment method based on fragment alignments and the TM- protein structure alignments. BMC Struct Biol 2007, 7:50. score. BMC Bioinform 2008, 9: Art. No. 531. 50. Andreeva A, Howorth D, Chandonia J-M, Brenner SE, 64. Aung Z, Tan KL: MatAlign: precise protein structure Data growth and its impact Hubbard TJP, Chothia C, Murzin AG: comparison by matrix alignment. J Bioinform Comput Biol 2006, on the SCOP database: new developments . Nucleic Acids Res 4:1197-1216. 2008, 36:D419-D425. Structural similarity of DNA- 51. Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, 65. Subbiah S, Laurents DV, Levitt M: binding domains of bacteriophage repressors and the globin Orengo CA: The CATH classification revisited — architectures reviewed and new ways to characterize structural divergence core. Curr Biol 1993, 3:141-148. in superfamilies. Nucleic Acids Res 2008, 36:D310-D314. 66. Standley DM, Toh H, Nakamura H: Detecting local structural 52. Qi Y, Sadreyev RI, Wang Y, Kim B-H, Grishin NV: A similarity in proteins by maximizing number of equivalent comprehensive system for evaluation of remote sequence residues. Proteins 2004, 57:381-391. similarity detection. BMC Bioinform 2007, 8:314. 67. Shatsky M, Wolfson HJ, Nussinov R: Flexible protein alignment 53. Suhrer J, Wiederstein M, Sippl M: QSCOP–SCOP quantified by and hinge detection. Proteins: Struct, Funct, Genet 2002, structural relationships. Bioinformatics 2007, 23:513-514. 48:242-256. 54. Holm L, Sander C: An evolutionary treasure: unification of a 68. Szustakowski JD, Weng Z: Protein structure alignment using broad set of amidohydrolases related to urease. Proteins 1997, evolutionary computing.In Evolutionary Computation in 28:72-82. Bioinformatics. Edited by Fogel G, Corne D. Morgan Kaufman; 2002. 55. Taylor WR, Orengo CA: A local alignment method for protein structure motifs. J Mol Biol 1993, 233:488-497. 69. Zhu J, Weng Z: FAST: a novel protein structure alignment algorithm. Proteins: Struct Funct Bioinform 2005, 14:417-423. 56. Zemla A: LGA—a method for finding 3D similarities in protein structures. Nucleic Acids Res 2003, 31(13):3370-3374. 70. Veeramalai M, Gilbert D: A novel method for comparing topological models of protein structures enhanced with ligand 57. Zemla A, Geisbrecht B, Smith J, Lam M, Kirkpatrick B, Wagner M, information. Bioinformatics 2008, 24:2698-2705. Slezak T, Zhou CE: STRALCP structure alignment-based clustering of proteins. Nucleic Acids Res 2007, 35(22):e150 doi: 71. Holm L, Sander C: Dictionary of recurrent domains in protein 10.1093/nar/gkm1049. structures. Proteins 1998, 33:88-96.

Current Opinion in Structural Biology 2009, 19:341–348 www.sciencedirect.com