Sequence Alignment What Is a Multiple Alignment?

Total Page:16

File Type:pdf, Size:1020Kb

Sequence Alignment What Is a Multiple Alignment? What about comparing a group of sequences (instead of a group of structures): Multiple Sequence Alignment What is a multiple alignment? a representation of a set of sequences, where equivalent residues (e.g. functional, structural) are aligned in rows or more usually columns Example: part of an alignment of SH2 domains from 14 sequences lnk_rat crk1_mouse nck_human ht16_hydat pip5_human fer_human 1ab2 1mil 1blj 1shd 1lkkA 1csy 1bfi 1gri * conserved identical residues : conserved similar residues What is a multiple alignment? conserved residues conservation profile secondary structure Multiple sequence alignments How to score an alignment of many sequences? Given M sequences Ai , we can define a score for the multiple sequence alignment as the sum of the scores of all the induced pair alignments S=Si<j S(Ai,Aj) 1>ASPTLPLSLA S 2>SS-TLPA--A = 3>SSPTLPA--A 1>ASPTLPLSLA 1>ASPTLPLSLA 2>SS-TLPA--A +S =S 2>SS-TLPA--A 3>SSPTLPA--A +S 3>SSPTLPA--A Multiple sequence alignments Can you think to other measures instead of the sum of pairwise scores? Multiple sequence alignments Can you think to other measures instead of the sum of pairwise scores? Try to MAXIMIZE the information content of all the columns (see sequence logos) I log2 20 S( p) 20 S( p) pi ln pi i1 MINIMIZE the overall entropy Ncolums 20 S p ji ln p ji j1 i1 Multiple sequence alignments Multiple sequence alignment (MSA) The algorithmic problem is to find the alignment with the maximum score Exact algorithms Algorithms based of multi-dimensional dynamic programming have been implemented. However they are too slow when many sequences have to be compared. Progressive alignments Iterative algorithms Consistency-based algorithms Multiple Alignment Construction Optimal multiple alignment example : MSA (Lipman et al. 1989, Gupta et al. 1995) Optimal multiple alignment Extension of dynamic programming for 2 sequences => N dimensions Example : alignment of 3 sequences Problem : calculation time and memory requirements Time proportional to Nk for k sequences of length N => limited to less than 10 sequences Multiple Alignment Construction Optimal multiple alignment MSA, OMA Progressive multiple alignment ClustalW (Thompson et al. NAR. 1994) ClustalX (Thompson et al. NAR. 1997) Progressive multiple alignment Idea : Progressively align pairs of sequences (or groups of sequences) Problem : Start with which sequences ? How to decide order of alignment ? first align the most closely related sequences How to measure the similarity of the sequences ? align all the sequences pairwise calculate the similarity between each pair from the alignment Progressive multiple alignments A B C D Step1: Pairwise sequence alignment: exact, all-against-all A A B B D D S=70, id=50% S=60, id=40% S=95, id=80% A B C C C D S=100, id=90% S=70, id=50% S=80, id=70% Step1: Build a similarity tree A C B D Progressive multiple alignments Step 3: Exact alignment of the most similar sequences, following the tree A B C D Step 4: Build the profile from the sub alignments) Step 5: Perfom profile-to profile alignment following the similarity tree, until comprising all the sequences A C B D Profile-to-profile score 1 The position i along the first sequence profile, it is represented by a 20-valued vector P i = 1 1 1 P i(A) P i (C) …… P i (Y) The position j along the second sequence profile, it is represented by a 20-valued vector 2 1 1 1 P j = P j(A) P j (C) …… P j (Y) The score for aligning the two positions is 20 20 1 2 Score(i, j) P i rm P j rk M rm ,rk m1 k1 where M is a matrix score (BLOSUM or PAM) The score can be used in dynamic programming procedures (Needleman- Wunsch, Smith-Waterman) “once a gap, always a gap” • Where gaps are added is a critical question • Gaps are often added to the first two (closest) sequences • To change the initial gap choices later on would be to give more weight to distantly related sequences • To maintain the initial gap choices is to trust that those gaps are most believable Problems with Progressive algorithms • Dependence of the ultimate MSA on the initial pairwise sequence alignment with the highest score • Errors in initial alignments are propagated • Gaps can proliferate, if not careful • Gaps can be amino-acid specific, so that you penalize introduction of gaps into segments that are less likely to have gaps (e.g. hydrophobic core) CLUSTALW Multiple sequence alignment of cytochromes • Download from UniProtKB the sequences of the following proteins (in FASTA format) • P99999 (human) • P00004 (horse) • P0C0X8 (Rhodobacter) • P00091 (Rhodopseudomonas) • Q93VA3 (Arabidopsis) And align with ClustalW @ http://clustalw.ddbj.nig.ac.jp/ Result summary CLUSTAL 2.1 Multiple Sequence Alignments Sequence type explicitly set to Protein Sequence format is Pearson Sequence 1: sp|P99999|CYC_HUMAN_Cytochrome_c_OS=Homo_sapiens_GN=CYCS_PE=1_SV=2 105 aa Sequence 2: sp|P00004|CYC_HORSE_Cytochrome_c_OS=Equus_caballus_GN=CYCS_PE=1_SV=2 105 aa Sequence 3: sp|P0C0X8|CYC2_RHOSH_Cytochrome_c2_OS=Rhodobacter_sphaeroides_GN=cycA_PE=1_SV=1 124 aa Sequence 4: sp|P00091|CYC22_RHOPA_Cytochrome_c2_OS=Rhodopseudomonas_palustris 139 aa Sequence 5: sp|Q93VA3|CYC6_ARATH_Cytochrome_c6__chloroplastic_OS=Arabidopsis_thaliana_ 175 aa Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 88 Sequences (1:3) Aligned. Score: 30 Sequences (1:4) Aligned. Score: 27 Sequences (1:5) Aligned. Score: 5 Sequences (2:3) Aligned. Score: 32 Sequences (2:4) Aligned. Score: 28 Sequences (2:5) Aligned. Score: 5 Sequences (3:4) Aligned. Score: 36 Sequences (3:5) Aligned. Score: 3 Sequences (4:5) Aligned. Score: 3 Guide tree file created: [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.dnd] Result summary There are 4 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:2188 Group 2: Sequences: 2 Score:1660 Group 3: Sequences: 4 Score:1071 Group 4: Delayed Alignment Score 1148 CLUSTAL-Alignment file created [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.aln] CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is CLUSTAL Sequence 1: sp|P99999|CYC_HUMAN_Cytochrome 177 aa Sequence 2: sp|P00004|CYC_HORSE_Cytochrome 177 aa Sequence 3: sp|P0C0X8|CYC2_RHOSH_Cytochrom 177 aa Sequence 4: sp|P00091|CYC22_RHOPA_Cytochro 177 aa Sequence 5: sp|Q93VA3|CYC6_ARATH_Cytochrom 177 aa Bootstrap output file created: [../work/2016/03/22/0612/clustalw_1458594734_746144_63242.phb] Guide tree ( ( sp|P99999|CYC_HUMAN:0.06190, sp|P00004|CYC_HORSE:0.05238) :0.28253, ( sp|P0C0X8|CYC2_RHOSH:0.31114, sp|P00091|CYC22_RHOPA:0.32595) :0.04416, sp|Q93VA3|CYC6_ARATH_:0.60318); The tree is visualised with http://etetoolkit.org/treeview/ Visualizing trees with etetoolkit CLUSTAL 2.1 multiple sequence alignment sp|P99999|CYC_HUMAN ------------------------MGDVEKGKKIFIMKCS------QCHT 20 sp|P00004|CYC_HORSE ------------------------MGDVEKGKKIFVQKCA------QCHT 20 sp|P0C0X8|CYC2_RHOSH -----------------------QEGDPEAGAKAFNQCQTCHVIVDDSGT 27 sp|P00091|CYC22_RHOPA ----------MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVFKQCMT 40 sp|Q93VA3|CYC6_ARATH_ MRLVLSGASSFTSNLFCSSQQVNGRGKELKNPISLNHNKDLDFLLKKLAP 50 *. sp|P99999|CYC_HUMAN VEKGGKHKTGPNLHG--LFGRKTGQAPGYS-YTAANKN---KGIIWGEDT 64 sp|P00004|CYC_HORSE VEKGGKHKTGPNLHG--LFGRKTGQAPGFT-YTDANKN---KGITWKEET 64 sp|P0C0X8|CYC2_RHOSH TIAGRNAKTGPNLYG--VVGRTAGTQADFKGYGEGMKEAGAKGLAWDEEH 75 sp|P00091|CYC22_RHOPA CHRADKNMVGPALGG--VVGRKAGTAAGFT-YSPLNHNSGEAGLVWTADN 87 sp|Q93VA3|CYC6_ARATH_ PLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDTGGNIIQPG 100 . ..* .*:. : * sp|P99999|CYC_HUMAN LMEYLENP------------------KKYIPG----------TKMIFVGI 86 sp|P00004|CYC_HORSE LMEYLENP------------------KKYIPG----------TKMIFAGI 86 sp|P0C0X8|CYC2_RHOSH FVQYVQDPTK--------------FLKEYTGD------AKAKGKMTFK-L 104 sp|P00091|CYC22_RHOPA IINYLNDPNA--------------FLKKFLTDKGKADQAVGVTKMTFK-L 122 sp|Q93VA3|CYC6_ARATH_ ATLFTKDLERNGVDTEEEIYRVTYFGKGRMPGFG--EKCTPRGQCTFGPR 148 : :: * . : * sp|P99999|CYC_HUMAN KKKEERADLIAYLKKATNE-------- 105 sp|P00004|CYC_HORSE KKKTEREDLIAYLKKATNE-------- 105 sp|P0C0X8|CYC2_RHOSH KKEADAHNIWAYLQQVAVRP------- 124 sp|P00091|CYC22_RHOPA ANEQQRKDVVAYLATLK---------- 139 sp|Q93VA3|CYC6_ARATH_ LQDEEIKLLAEFVKFQADQGWPTVSTD 175 Very bad alignment. :. : : :: Visualizing and manipulating alignments with JalView ‘’Bootstrap’’ tree (based on distances computed from MSA) ( ( ( sp|P99999|CYC_HUMAN:0.06522, sp|P00004|CYC_HORSE:0.04907) 1000:0.24564, sp|P0C0X8|CYC2_RHOSH:0.33664) 539:0.01376, sp|P00091|CYC22_RHOPA:0.32266, NB. It is not usually a good tree for sp|Q93VA3|CYC6_ARATH_:0.53865)TRICHOTOMY; inferring phylogenetic relationships CLUSTAL 2.1 multiple sequence alignment sp|P99999|CYC_HUMAN ------------------------MGDVEKGKKIFIMKCS------QCHT 20 sp|P00004|CYC_HORSE ------------------------MGDVEKGKKIFVQKCA------QCHT 20 sp|P0C0X8|CYC2_RHOSH -----------------------QEGDPEAGAKAFNQCQTCHVIVDDSGT 27 sp|P00091|CYC22_RHOPA ----------MVKKLLTILSIAATAGSLSIGTASAQDAKAGEAVFKQCMT 40 sp|Q93VA3|CYC6_ARATH_ MRLVLSGASSFTSNLFCSSQQVNGRGKELKNPISLNHNKDLDFLLKKLAP 50 *. sp|P99999|CYC_HUMAN VEKGGKHKTGPNLHG--LFGRKTGQAPGYS-YTAANKN---KGIIWGEDT 64 sp|P00004|CYC_HORSE VEKGGKHKTGPNLHG--LFGRKTGQAPGFT-YTDANKN---KGITWKEET 64 sp|P0C0X8|CYC2_RHOSH TIAGRNAKTGPNLYG--VVGRTAGTQADFKGYGEGMKEAGAKGLAWDEEH 75 sp|P00091|CYC22_RHOPA CHRADKNMVGPALGG--VVGRKAGTAAGFT-YSPLNHNSGEAGLVWTADN 87 sp|Q93VA3|CYC6_ARATH_ PLTAVLLAVSPICFPPESLGQTLDIQRGATLFNRACIGCHDTGGNIIQPG 100 . ..* .*:. : * sp|P99999|CYC_HUMAN LMEYLENP------------------KKYIPG----------TKMIFVGI 86 sp|P00004|CYC_HORSE LMEYLENP------------------KKYIPG----------TKMIFAGI 86 sp|P0C0X8|CYC2_RHOSH FVQYVQDPTK--------------FLKEYTGD------AKAKGKMTFK-L
Recommended publications
  • T-Coffee Documentation Release Version 13.45.47.Aba98c5
    T-Coffee Documentation Release Version_13.45.47.aba98c5 Cedric Notredame Aug 31, 2021 Contents 1 T-Coffee Installation 3 1.1 Installation................................................3 1.1.1 Unix/Linux Binaries......................................4 1.1.2 MacOS Binaries - Updated...................................4 1.1.3 Installation From Source/Binaries downloader (Mac OSX/Linux)...............4 1.2 Template based modes: PSI/TM-Coffee and Expresso.........................5 1.2.1 Why do I need BLAST with T-Coffee?.............................6 1.2.2 Using a BLAST local version on Unix.............................6 1.2.3 Using the EBI BLAST client..................................6 1.2.4 Using the NCBI BLAST client.................................7 1.2.5 Using another client.......................................7 1.3 Troubleshooting.............................................7 1.3.1 Third party packages......................................7 1.3.2 M-Coffee parameters......................................9 1.3.3 Structural modes (using PDB)................................. 10 1.3.4 R-Coffee associated packages................................. 10 2 Quick Start Regressive Algorithm 11 2.1 Introduction............................................... 11 2.2 Installation from source......................................... 12 2.3 Examples................................................. 12 2.3.1 Fast and accurate........................................ 12 2.3.2 Slower and more accurate.................................... 12 2.3.3 Very Fast...........................................
    [Show full text]
  • Multiple Sequence Alignment: a Major Challenge to Large-Scale Phylogenetics
    Multiple sequence alignment: a major challenge to large-scale phylogenetics November 18, 2010 · Tree of Life Kevin Liu1, C. Randal Linder2, Tandy Warnow3 1 Postdoctoral Research Associate, Austin, TX, 2 Associate Professor, Austin, TX, 3 Professor of Computer Science, Department of Computer Science, University of Texas at Austin Liu K, Linder CR, Warnow T. Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLOS Currents Tree of Life. 2010 Nov 18 [last modified: 2012 Apr 23]. Edition 1. doi: 10.1371/currents.RRN1198. Abstract Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates.
    [Show full text]
  • Ple Sequence Alignment Methods: Evidence from Data
    Mul$ple sequence alignment methods: evidence from data Tandy Warnow Alignment Error/Accuracy • SPFN: percentage of homologies in the true alignment that are not recovered (false negave homologies) • SPFP: percentage of homologies in the es$mated alignment that are false (false posi$ve homologies) • TC: total number of columns correctly recovered • SP-score: percentage of homologies in the true alignment that are recovered • Pairs score: 1-(avg of SP-FN and SP-FP) Benchmarks • Simulaons: can control everything, and true alignment is not disputed – Different simulators • Biological: can’t control anything, and reference alignment might not be true alignment – BAliBASE, HomFam, Prefab – CRW (Comparave Ribosomal Website) Alignment Methods (Sample) • Clustal-Omega • MAFFT • Muscle • Opal • Prank/Pagan • Probcons Co-es$maon of trees and alignments • Bali-Phy and Alifritz (stas$cal co-es$maon) • SATe-1, SATe-2, and PASTA (divide-and-conquer co- es$maon) • POY and Beetle (treelength op$mizaon) Other Criteria • Tree topology error • Tree branch length error • Gap length distribu$on • Inser$on/dele$on rao • Alignment length • Number of indels How does the guide tree impact accuracy? • Does improving the accuracy of the guide tree help? • Do all alignment methods respond iden$cally? (Is the same guide tree good for all methods?) • Do the default sengs for the guide tree work well? Alignment criteria • Does the relave performance of methods depend on the alignment criterion? • Which alignment criteria are predic$ve of tree accuracy? • How should we design MSA methods to produce best accuracy? Choice of best MSA method • Does it depend on type of data (DNA or amino acids?) • Does it depend on rate of evolu$on? • Does it depend on gap length distribu$on? • Does it depend on existence of fragments? Katoh and Standley .
    [Show full text]
  • Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods
    International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods Arunima Mishra, B. K. Tripathi, S. S. Soam MSA is a well-known method of alignment of three or more Abstract: Protein Multiple sequence alignment (MSA) is a biological sequences. Multiple sequence alignment is a very process, that helps in alignment of more than two protein intricate problem, therefore, computation of exact MSA is sequences to establish an evolutionary relationship between the only feasible for the very small number of sequences which is sequences. As part of Protein MSA, the biological sequences are not practical in real situations. Dynamic programming as used aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and in pairwise sequence method is impractical for a large number hence the volume of biological data generated is increasing at an of sequences while performing MSA and therefore the enormous rate. This increase in volume of data poses a challenge heuristic algorithms with approximate approaches [7] have to the existing methods used to perform effective MSA as with the been proved more successful. Generally, various biological increase in data volume the computational complexities also sequences are organized into a two-dimensional array such increases and the speed to process decreases. The accuracy of that the residues in each column are homologous or having the MSA is another factor critically important as many bioinformatics same functionality. Many MSA methods were developed over inferences are dependent on the output of MSA.
    [Show full text]
  • Multiple Sequence Alignment Based on Profile Alignment of Intermediate
    Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Yue Lu1 and Sing-Hoi Sze1,2 1 Department of Biochemistry & Biophysics 2 Department of Computer Science, Texas A&M University, College Station, TX 77843, USA Abstract. Despite considerable efforts, it remains difficult to obtain accurate multiple sequence alignments. By using additional hits from database search of the input sequences, a few strategies have been pro- posed to significantly improve alignment accuracy, including the con- struction of profiles from the hits while performing profile alignment, the inclusion of high scoring hits into the input sequences, the use of interme- diate sequence search to link distant homologs, and the use of secondary structure information. We develop an algorithm that integrates these strategies to further improve alignment accuracy by modifying the pair- HMM approach in ProbCons to incorporate profiles of intermediate se- quences from database search and utilize secondary structure predictions as in SPEM. We test our algorithm on a few sets of benchmark multi- ple alignments, including BAliBASE, HOMSTRAD, PREFAB and SAB- mark, and show that it significantly outperforms MAFFT and ProbCons, which are among the best multiple alignment algorithms that do not uti- lize additional information, and SPEM, which is among the best multiple alignment algorithms that utilize additional hits from database search. The improvement in accuracy over SPEM can be as much as 5 to 10% when aligning divergent sequences. A software program that implements this approach (ISPAlign) is at http://faculty.cs.tamu.edu/shsze/ispalign. 1 Introduction Although many algorithms have been proposed for multiple sequence alignment (Thompson et al.
    [Show full text]
  • Multiple Sequence Alignments Iain M Wallace, Gordon Blackshields and Desmond G Higgins
    Multiple sequence alignments Iain M Wallace, Gordon Blackshields and Desmond G Higgins Multiple sequence alignments are very widely used in all areas ‘progressive alignment’. This was described in various of DNA and protein sequence analysis. The main methods ways by several different groups and resulted in a series of that are still in use are based on ‘progressive alignment’ and programs in the mid to late 1980s that are still in use today date from the mid to late 1980s. Recently, some dramatic [8–12]. Progressive alignment allows large alignments of improvements have been made to the methodology with distantly related sequences to be constructed quickly and respect either to speed and capacity to deal with large numbers simply. It is based on building the full alignment up of sequences or to accuracy. There have also been some progressively, using the branching order of a quick practical advances concerning how to combine three- approximate tree (called the guide tree) to guide the dimensional structural information with primary sequences alignments. It is implemented in the most widely used to give more accurate alignments, when structures are programs (ClustalW [13] and ClustalX [14]), but is also available. used as the optimiser for other programs, such as T-Coffee [15]. The latter is unusual in that it uses the Addresses maximum weight trace [16] or ‘Coffee’ [17] objective The Conway Institute of Biomolecular and Biomedical Research, function, rather than a more conventional dynamic pro- University College Dublin, Ireland gramming sequence distance score. T-Coffee is of great Corresponding author: Higgins, Desmond G ([email protected]) interest not only because of the way it allows heteroge- neous data to be merged in alignments (see 3D-Coffee below) but also as a precursor to the probabilistic-based Current Opinion in Structural Biology 2005, 15:261–266 program PROBCONS [18], which is the most accurate This review comes from a themed issue on method available.
    [Show full text]
  • Multiple Sequence Alignment II Outline
    Multiple Sequence Alignment II Lectures 20 – Dec 5, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline Multiple sequence alignment methods Progressive: PileUp, Clustal W (Thompson et al. 1994) Iterative: MUSCLE (Edgar 2004) Consistency-based: ProbCons (Do et al. 2005) 2 1 Review: Progressive Alignment Three basic steps shared by all progressive alignment algorithms: A. Calculate a matrix of pairwise distances based on pairwise alignments between the sequences B. Use the result of A to build a guide tree, which is an inferred phylogeny for the sequences C. Use the tree from B to guide the progressive alignment of the sequences Input Similarity Guide Progressive sequences matrix tree alignment AGTTGG AGTTGG ACTTGG A B C ACTTGG ACGT AC--GT CCTTGG CCTTGG [ ] 3 ACTTGT ACTTGT (A) Calculating the Pairwise Distances A pair of sequences is aligned by the usual dynamic programming algorithm, and then a similarity or distance measure for the pair is calculated using the aligned portion (gaps excluded) - for example, percent identity. Input Pairwise Similarity sequences alignments matrix AGTTGG AGTTGG ACTTGG ACTTGG AC--GT ACGT CCTTGG CCTTGG : [ ] ACTTGT 4 2 Globin Example DISTANCES between protein sequences: Calculated over: 1 to 167 Correction method: Simple distance (no corrections) Distances are: observed number of substitutions per 100 amino acids Symmatrix version 1 Number of matrices: 1 // Matrix 1, dimension: 7 Key for column and
    [Show full text]
  • Grammar-Based Distance in Progressive Multiple Sequence Alignment
    University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Faculty Publications from the Department of Electrical & Computer Engineering, Department Electrical and Computer Engineering of 2008 Grammar-Based Distance in Progressive Multiple Sequence Alignment David Russell University of Nebraska-Lincoln, [email protected] Hasan H. Otu Harvard Medical School, [email protected] Khalid Sayood University of Nebraska-Lincoln, [email protected] Follow this and additional works at: https://digitalcommons.unl.edu/electricalengineeringfacpub Part of the Electrical and Computer Engineering Commons Russell, David; Otu, Hasan H.; and Sayood, Khalid, "Grammar-Based Distance in Progressive Multiple Sequence Alignment" (2008). Faculty Publications from the Department of Electrical and Computer Engineering. 100. https://digitalcommons.unl.edu/electricalengineeringfacpub/100 This Article is brought to you for free and open access by the Electrical & Computer Engineering, Department of at DigitalCommons@University of Nebraska - Lincoln. It has been accepted for inclusion in Faculty Publications from the Department of Electrical and Computer Engineering by an authorized administrator of DigitalCommons@University of Nebraska - Lincoln. BMC Bioinformatics BioMed Central Research article Open Access Grammar-based distance in progressive multiple sequence alignment David J Russell*1, Hasan H Otu2 and Khalid Sayood1 Address: 1Department of Electrical Engineering, University of Nebraska-Lincoln, 209N WSEC, Lincoln, NE, 68588-0511, USA and 2New
    [Show full text]
  • Multiple Sequence Alignment Using Probcons and Probalign
    Multiple sequence alignment using Probcons and Probalign Usman Roshan Summary Sequence alignment remains a fundamental task in bioinformatics. The literature contains programs that employ a host of exact and heuristic strategies available in computer science. Probcons was the first program to construct maximum expected accuracy sequence alignments with hidden Markov models and at the time of its publication achieved the highest accuracies on standard protein multiple alignment benchmarks. Probalign followed this strategy except that it used a partition function approach instead of hidden Markov models. Several programs employing both strategies have been published since then. In this chapter we describe Probcons and Probalign. Keywords sequence alignment, expected accuracy, hidden Markov models, partition function 1. Introduction Multiple protein sequence alignment is one of the most commonly used task in bioinformatics [1]. It has widespread applications that include detecting functional regions in proteins [2] and reconstructing complex evolutionary histories [1,3]. Techniques for constructing accurate alignments are therefore of great interest to the bioinformatics community. ClustalW [4] is one of the earliest multiple sequence aligners and remains popular to date. Other programs include Dialign [5], T-Coffee [6], MUSCLE [7], and MAFFT [8]. Given the importance of multiple sequence alignment, several protein alignment benchmarks have been created for unbiased accuracy assessment of alignment quality. Of these, BAliBASE [9,10,11] is by far the most commonly used. The BAliBASE benchmark alignments are computed using superimposition of protein structures. Prior to Probcons [12] most programs optimized the sum-of-pairs score of a multiple alignment or computed the Viterbi alignment [3]. Probcons computes the maximal expected accuracy alignment instead.
    [Show full text]
  • Enhanced PROBCONS for Multiple Sequence Alignment in Cloud Computing
    I.J. Information Technology and Computer Science, 2019, 9, 38-47 Published Online September 2019 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2019.09.05 Enhanced PROBCONS for Multiple Sequence Alignment in Cloud Computing Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 15 May 2019; Accepted: 23 May 2019; Published: 08 September 2019 Abstract—Multiple protein sequence alignment (MPSA) But Local algorithms maximize the alignment of intend to realize the similarity between multiple protein similar subregions, for example, Smith-Waterman sequences and increasing accuracy. MPSA turns into a algorithm. critical bottleneck for large scale protein sequence data sets. It is vital for existing MPSA tools to be kept running _ _ _ _ _ _ _ _ _ F G K G _ _ _ _ _ _ _ _ _ in a parallelized design. Joining MPSA tools with cloud | | | computing will improve the speed and accuracy in case of _ _ _ _ _ _ _ _ _ F G K T _ _ _ _ _ _ _ _ _ large scale data sets. PROBCONS is probabilistic consistency for progressive MPSA based on hidden Multiple sequence alignment (MSA) is contained more Markov models. PROBCONS is an MPSA tool that than pairwise sequences. MSA is aligned simultaneously achieves the maximum expected accuracy, but it has a obtained by inserting gaps (-) into sequences [2, 3].
    [Show full text]
  • Improvement in Accuracy of Multiple Sequence Alignment Using Novel
    BMC Bioinformatics BioMed Central Research article Open Access Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost Shinsuke Yamada*1,2, Osamu Gotoh2,3 and Hayato Yamana1 Address: 1Department of Computer Science, Graduate School of Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan, 2Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), 2- 43 Aomi, Koto-ku, Tokyo 135-0064, Japan and 3Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo-ku, Kyoto 606-8501, Japan Email: Shinsuke Yamada* - [email protected]; Osamu Gotoh - [email protected]; Hayato Yamana - [email protected] * Corresponding author Published: 01 December 2006 Received: 20 June 2006 Accepted: 01 December 2006 BMC Bioinformatics 2006, 7:524 doi:10.1186/1471-2105-7-524 This article is available from: http://www.biomedcentral.com/1471-2105/7/524 © 2006 Yamada et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: Multiple sequence alignment (MSA) is a useful tool in bioinformatics. Although many MSA algorithms have been developed, there is still room for improvement in accuracy and speed. In the alignment of a family of protein sequences, global MSA algorithms perform better than local ones in many cases, while local ones perform better than global ones when some sequences have long insertions or deletions (indels) relative to others.
    [Show full text]
  • Benchmarking Statistical Multiple Sequence Alignment
    bioRxiv preprint doi: https://doi.org/10.1101/304659; this version posted April 20, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Benchmarking Statistical Multiple Sequence Alignment Michael Nute, Ehsan Saleh, and Tandy Warnow The University of Illinois at Urbana-Champaign Abstract The estimation of multiple sequence alignments of protein sequences is a basic step in many bioinformatics pipelines, including protein structure prediction, protein family identification, and phylogeny estimation. Sta- tistical co-estimation of alignments and trees under stochastic models of sequence evolution has long been considered the most rigorous technique for estimating alignments and trees, but little is known about the accuracy of such methods on biological benchmarks. We report the results of an extensive study evaluating the most popular protein alignment methods as well as the statistical co-estimation method BAli-Phy on 1192 protein data sets from established benchmarks as well as on 120 simulated data sets. Our study (which used more than 230 CPU years for the BAli-Phy analyses alone) shows that BAli-Phy is dramatically more accurate than the other alignment methods on the simulated data sets, but is among the least accurate on the biological benchmarks. There are several poten- tial causes for this discordance, including model misspecification, errors in the reference alignments, and conflicts between structural alignment and evolutionary alignments; future research is needed to understand the most likely explanation for our observations.
    [Show full text]