BMC Bioinformatics Biomed Central

BMC Bioinformatics BioMed Central Research article Open Access State of the art: refinement of multiple sequence alignments Saikat Chakrabarti*, Christopher J Lanczycki, Anna R Panchenko, Teresa M Przytycka, Paul A Thiessen and Stephen H Bryant Address: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA Email: Saikat Chakrabarti* - [email protected]; Christopher J Lanczycki - [email protected]; Anna R Panchenko - [email protected]; Teresa M Przytycka - [email protected]; Paul A Thiessen - [email protected]; Stephen H Bryant - [email protected] * Corresponding author Published: 14 November 2006 Received: 31 July 2006 Accepted: 14 November 2006 BMC Bioinformatics 2006, 7:499 doi:10.1186/1471-2105-7-499 This article is available from: http://www.biomedcentral.com/1471-2105/7/499 © 2006 Chakrabarti et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Accurate multiple sequence alignments of proteins are very important in computational biology today. Despite the numerous efforts made in this field, all alignment strategies have certain shortcomings resulting in alignments that are not always correct. Refinement of existing alignment can prove to be an intelligent choice considering the increasing importance of high quality alignments in large scale high-throughput analysis. Results: We provide an extensive comparison of the performance of the alignment refinement algorithms. The accuracy and efficiency of the refinement programs are compared using the 3D structure-based alignments in the BAliBASE benchmark database as well as manually curated high quality alignments from Conserved Domain Database (CDD). Conclusion: Comparison of performance for refined alignments revealed that despite the absence of dramatic improvements, our refinement method, REFINER, which uses conserved regions as constraints performs better in improving the alignments generated by different alignment algorithms. In most cases REFINER produces a higher-scoring, modestly improved alignment that does not deteriorate the well-conserved regions of the original alignment. Background ing a global alignment over the entire length of the The reliability and accuracy of many bioinformatics meth- sequences; they differ mainly in the procedure employed ods such as homolog identification, comparative mode- to determine the order of the sequences to be aligned. Iter- ling, phylogenetic analysis and others depend heavily on ative algorithms [4,5] generally attempt to improve the the quality of multiple sequence alignments. Heuristic overall quality of alignment by employing an objective approaches such as progressive and iterative methods are function and heuristic measures to obtain an optimal generally used to obtain multiple sequence alignments in alignment. Alternative approaches that utilize a co-opera- a computationally efficient manner. In progressive tive strategy to integrate complementary algorithms [6,7] approaches, a multiple alignment is generally built up and/or incorporate additional biological data [8,9] have gradually by aligning the most similar sequences first and also been developed. successively adding in more distant relatives. A number of alignment programs apply this strategy [1-3] by construct- Page 1 of 10 (page number not for citation purposes) BMC Bioinformatics 2006, 7:499 http://www.biomedcentral.com/1471-2105/7/499 Despite the numerous efforts made in this field, each of junction with all alignment-generation algorithms tested. these strategies has weaknesses that can result in align- Further, of the three methods studied REFINER best ments that do not reflect the correct evolutionary relation- avoids degrading the original alignment's quality. ships. This persistent difficulty of course reflects the fact that aligning multiple sequences is a highly non-trivial Results and discussion task (in both a biological and computational sense) Improvement of alignment whose accuracy in practice depends largely on the choice Alignments generated by ClustalW version 1.83 [18], of input sequences, the objective function and the heuris- Muscle version 3.52 [19], Dialign version 2.3 [20], FFTNSI tics employed. Therefore, the application of an alignment from the Mafft package version 5.743 [5,21], ProbCons refinement algorithm to an existing or automatically-gen- version 1.09 [22] and TCoffee version 3.93 [7], were erated alignment can be helpful for detecting alignment refined by three different methods: REFINER [15], RF problems. Alignment refinement as a post-processing method [13] and RASCAL [14]. Since each of these refine- operation is particularly worthwhile considering the ment methods seeks a multiple alignment with the high- increasing importance of high quality alignments in large est score, we first compared their optimization procedures scale high-throughput analysis. by calculating scores from the refined alignments using four different objective scores. Figure 1 shows the relative Alignment refinement has mainly relied on iterative improvement of refinement as measured by alignment approaches [4,8,10-12]. Recently Wallace et al., [13] stud- score, conservation score [SCORECONS score, [23]], ied various iterative schemes and showed that perform- norMD score [24] and information content for the BAli- ance of alignment algorithms can be improved by BASE benchmark alignments. The percentage of refined including iteration steps during the progressive align- alignments that exhibit an improved score, i.e. where the ment. Another refinement program, RASCAL, imple- refined alignment has a higher objective score than the mented by Thompson et al., [14] uses a knowledge-based original alignments, is found to be highest for REFINER strategy to improve alignments where alignment is for all objective scores tested. When using alignment decomposed into reliable and unreliable regions and only score, SCORECONS score, norMD score and information unreliable alignment regions are modified. content, respectively, these percentages are 94%, 98%, 90% and 84% for REFINER, 92%, 90%, 86% and 63% for Recently we reported a new algorithm, REFINER [15], that the RF method, and 94%, 94%, 87% and 57% for RAS- refines a multiple sequence alignment by iterative realign- CAL. Notably, these values also reveal that REFINER pro- ment of its individual sequences, using the predetermined duces fewer cases in which the objective score of the conserved core (block) model as a constraint. Realign- refined alignments drops. ment of each sequence can correct misalignments between a given sequence and the rest of the profile by We observe similar results (Figure 2) when REFINER, RF shifting the individual aligned blocks on that sequence yet and RASCAL are applied to alignments from the CDD at the same time preserves the family's overall block alignment dataset. In this case, refinement algorithms model (i.e., the sequence and structurally conserved were applied to the original CDD alignments and objec- regions). The constraint prohibits the insertion of gap tive scores are computed pre- and post-refinement. characters in the middle of conserved blocks. Although the extent of positive improvement (REFINER: 45%, 42%, 34% and 68%; RF: 35%, 31%, 20% and 51% In this study we compare the performance of three pub- and RASCAL: 41%, 25%, 11% and 32%) is much lower in lished alignment refinement algorithms. The accuracy and this dataset, relative to the other methods REFINER per- efficiency of RASCAL [14], the Remove First (RF) method forms consistently well for different scoring functions and from Wallace et al., [13] and REFINER [15] methods were also results in fewer cases where alignment accuracy actu- compared using the 3D structure-based alignments from ally degrades following refinement. the BAliBASE benchmark database [16] and a collection of manually curated high quality alignments from Con- While REFINER performs well numerically (in terms of served Domain Database (CDD) [17]. The quality of the the objective score), the biological relevance of the refined refined alignments was assessed in terms of various scor- alignments should also be considered. In this context the ing functions, by consistency with structure-structure BAliBASE sum-of-pairs (SP) scoring scheme [16] is alignments from BAliBASE, and measuring sensitivity in adopted to evaluate the quality of derived alignments. The profile-based database searches. As a practical matter, we improvements in SP score exhibited by the REFINER, RF also report the CPU time required by the three methods. and RASCAL refinement methods are illustrated in Table This comparison study reveals that while none of the 1. The 'Default' columns in Table 1 provide the average SP refinement methods provide dramatic improvements, the score for alignments generated by the individual align- REFINER algorithm performs consistently well in con- ment programs used in this benchmarking study, whereas Page 2 of 10 (page number not for citation purposes) BMC Bioinformatics 2006, 7:499 http://www.biomedcentral.com/1471-2105/7/499 ImprovementFigure 1 of objective scores after refinement (BAliBASE dataset) Improvement of objective scores after refinement (BAliBASE

BMC Bioinformatics Biomed Central

T-Coffee Documentation Release Version 13.45.47.Aba98c5

Multiple Sequence Alignment: a Major Challenge to Large-Scale Phylogenetics

Ple Sequence Alignment Methods: Evidence from Data

Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods

Multiple Sequence Alignment Based on Profile Alignment of Intermediate

Multiple Sequence Alignments Iain M Wallace, Gordon Blackshields and Desmond G Higgins

Multiple Sequence Alignment II Outline

Grammar-Based Distance in Progressive Multiple Sequence Alignment

Multiple Sequence Alignment Using Probcons and Probalign

Enhanced PROBCONS for Multiple Sequence Alignment in Cloud Computing

Improvement in Accuracy of Multiple Sequence Alignment Using Novel

Benchmarking Statistical Multiple Sequence Alignment