T-Coffee Documentation Release Version 13.45.47.Aba98c5

Total Page:16

File Type:pdf, Size:1020Kb

T-Coffee Documentation Release Version 13.45.47.Aba98c5 T-Coffee Documentation Release Version_13.45.47.aba98c5 Cedric Notredame Aug 31, 2021 Contents 1 T-Coffee Installation 3 1.1 Installation................................................3 1.1.1 Unix/Linux Binaries......................................4 1.1.2 MacOS Binaries - Updated...................................4 1.1.3 Installation From Source/Binaries downloader (Mac OSX/Linux)...............4 1.2 Template based modes: PSI/TM-Coffee and Expresso.........................5 1.2.1 Why do I need BLAST with T-Coffee?.............................6 1.2.2 Using a BLAST local version on Unix.............................6 1.2.3 Using the EBI BLAST client..................................6 1.2.4 Using the NCBI BLAST client.................................7 1.2.5 Using another client.......................................7 1.3 Troubleshooting.............................................7 1.3.1 Third party packages......................................7 1.3.2 M-Coffee parameters......................................9 1.3.3 Structural modes (using PDB)................................. 10 1.3.4 R-Coffee associated packages................................. 10 2 Quick Start Regressive Algorithm 11 2.1 Introduction............................................... 11 2.2 Installation from source......................................... 12 2.3 Examples................................................. 12 2.3.1 Fast and accurate........................................ 12 2.3.2 Slower and more accurate.................................... 12 2.3.3 Very Fast............................................ 13 2.4 Regressive flags............................................. 13 3 Quick Start Unistrap 15 3.1 Introduction............................................... 15 3.2 Examples................................................. 16 3.2.1 Produce bootsrap replicates by shuffling sequences and guide tree sister nodes........ 16 3.2.2 Produce bootsrap replicates by shuffling the guide tree sister nodes.............. 16 3.2.3 Get list of supported methods.................................. 16 3.3 unistrap flags............................................... 17 4 Quick Start T-Coffee 19 4.1 Basic Command Lines (or modes).................................... 19 4.1.1 Protein sequences........................................ 19 i 4.1.2 DNA sequences......................................... 20 4.1.3 RNA sequences......................................... 20 4.2 Brief Overview of T-Coffee Tools.................................... 21 4.2.1 Alignment methods....................................... 21 4.2.1.1 T-Coffee........................................ 21 4.2.1.2 M-Coffee....................................... 22 4.2.1.3 Expresso........................................ 22 4.2.1.4 R-Coffee........................................ 23 4.2.1.5 Pro-Coffee....................................... 23 4.2.2 Evaluation tools......................................... 24 4.2.2.1 TCS (MSA evaluation based on consistency)..................... 24 4.2.2.2 iRMSD/APDB (MSA structural evaluation)..................... 24 4.2.2.3 STRIKE (single structure MSA evaluation)...................... 24 4.2.2.4 T-RMSD (structural clustering)............................ 25 4.3 Tutorial (Practical Examples)...................................... 25 4.3.1 Introduction........................................... 25 4.3.2 Materials............................................ 25 4.3.3 Procedures........................................... 26 5 T-Coffee Main Documentation 27 5.1 Before You Start. ............................................ 27 5.1.1 Foreword............................................ 27 5.1.2 Prerequisite for using T-Coffee................................. 27 5.1.3 Let’s have a try. ......................................... 28 5.2 What is T-Coffee ?............................................ 28 5.2.1 What is T-Coffee?........................................ 28 5.2.1.1 What does it do?.................................... 28 5.2.1.2 What can it align?................................... 28 5.2.1.3 How can I use it?................................... 29 5.2.1.4 Is there an online webserver?............................. 29 5.2.1.5 Is T-Coffee different from ClustalW?......................... 29 5.2.1.6 Is T-Coffee very accurate?............................... 29 5.2.2 What T-Coffee can and cannot do for you . .......................... 29 5.2.2.1 What T-Coffee can’t do................................ 29 5.2.2.2 What T-Coffee can do................................. 29 5.2.3 How does T-Coffee alignment works?............................. 30 5.3 Preparing Your Data........................................... 31 5.3.1 The reformatting utility: seq_reformat............................. 31 5.3.1.1 General introduction.................................. 31 5.3.1.2 Modification options.................................. 32 5.3.1.3 Using a “cache” file.................................. 32 5.3.1.3.1 What is a cache in T-Coffee?........................ 32 5.3.1.3.2 Preparing a sequence/alignment cache................... 32 5.3.1.3.3 Preparing a library cache.......................... 33 5.3.2 Modifying the format of your data............................... 34 5.3.2.1 Keeping/Protecting your sequence names....................... 34 5.3.2.2 Changing the sequence format............................ 34 5.3.2.3 Changing the case................................... 35 5.3.2.3.1 Changing the case of your sequences.................... 35 5.3.2.3.2 Changing the case of specific residues................... 35 5.3.2.3.3 Changing the case with a cache....................... 35 5.3.2.4 Coloring/Editing residues in an alignment...................... 36 5.3.2.4.1 Changing the default colors......................... 36 5.3.2.4.2 Coloring specific types of residues/nucleic acids.............. 36 ii 5.3.2.4.3 Coloring a specific residue of a specific sequence............. 36 5.3.2.4.4 Coloring according to the conservation................... 37 5.3.2.4.5 Coloring an alignment using a cache.................... 37 5.3.3 Modifying the data itself. .................................... 37 5.3.3.1 Modifiying sequences in your dataset......................... 37 5.3.3.1.1 Converting residues............................. 37 5.3.3.1.2 Extracting sequences according to a pattern................ 38 5.3.3.1.3 Extracting/Removing specific sequences by names............ 38 5.3.3.1.4 Extracting the most informative sequences................. 39 5.3.3.1.5 Extracting/Removing sequences with the % identity............ 39 5.3.3.1.6 Chaining important sequences....................... 41 5.3.3.2 Modifying columns/blocks in your dataset...................... 41 5.3.3.2.1 Removing gapped columns......................... 41 5.3.3.2.2 Extracting specific columns......................... 41 5.3.3.2.3 Extracting entire blocks........................... 42 5.3.3.2.4 Concatenating blocks or MSAs....................... 43 5.3.4 Manipulating DNA sequences................................. 43 5.3.4.1 Translating DNA sequences into protein sequences.................. 43 5.3.4.2 Back-translation with the bona fide DNA sequences................. 43 5.3.4.3 Finding the bona fide sequences for the back-translation............... 43 5.3.5 Manipulating RNA Sequences................................. 43 5.3.5.1 Producing a Stockholm output: adding predicted secondary structures....... 43 5.3.5.1.1 Producing/Adding a consensus structure.................. 43 5.3.5.1.2 Adding a precomputed consensus structure to an alignment........ 44 5.3.5.2 Analyzing a RNAalifold secondary structure prediction............... 44 5.3.5.2.1 Visualizing compensatory mutations.................... 44 5.3.5.2.2 Analyzing matching columns........................ 45 5.3.5.3 Comparing alternative folds.............................. 45 5.3.6 Phylogenetic Trees Manipulation................................ 45 5.3.6.1 Producing phylogenetic trees............................. 45 5.3.6.2 Comparing two phylogenetic trees.......................... 46 5.3.6.3 Scanning phylogenetic trees.............................. 47 5.3.6.4 Pruning phylogenetic trees.............................. 47 5.3.6.5 Tree Reformatting for MAFFT............................ 47 5.3.7 Manipulating structure files (PDB)............................... 48 5.3.7.1 Extracting a structure................................. 48 5.3.7.2 Adapting extract_from_pdb to your own environment................ 48 5.4 Aligning Your Sequences........................................ 49 5.4.1 General comments on alignments and aligners......................... 49 5.4.1.1 What is a good alignment?.............................. 49 5.4.1.2 The main methods and their scope.......................... 49 5.4.1.2.1 ClustalW is really everywhere. ....................... 49 5.4.1.2.2 MAFFT/MUSCLE to align big datasets.................. 50 5.4.1.2.3 T-Coffee/ProbCons, slow but accurate !!!.................. 50 5.4.1.3 Choosing the right package (without flipping a coin !)................ 50 5.4.2 Protein Sequences........................................ 51 5.4.2.1 General considerations................................ 51 5.4.2.2 Computing a simple MSA (default T-Coffee)..................... 52 5.4.2.3 Aligning multiple datasets/Combining multiple MSAs................ 52 5.4.2.4 Estimating the diversity in your alignment...................... 53 5.4.2.5 Comparing alternative alignments........................... 53 5.4.2.6 Comparing subsets of alternative alignments....................
Recommended publications
  • Multiple Sequence Alignment: a Major Challenge to Large-Scale Phylogenetics
    Multiple sequence alignment: a major challenge to large-scale phylogenetics November 18, 2010 · Tree of Life Kevin Liu1, C. Randal Linder2, Tandy Warnow3 1 Postdoctoral Research Associate, Austin, TX, 2 Associate Professor, Austin, TX, 3 Professor of Computer Science, Department of Computer Science, University of Texas at Austin Liu K, Linder CR, Warnow T. Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLOS Currents Tree of Life. 2010 Nov 18 [last modified: 2012 Apr 23]. Edition 1. doi: 10.1371/currents.RRN1198. Abstract Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates.
    [Show full text]
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • How to Generate a Publication-Quality Multiple Sequence Alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012)
    Tutorial: How to generate a publication-quality multiple sequence alignment (Thomas Weimbs, University of California Santa Barbara, 11/2012) 1) Get your sequences in FASTA format: • Go to the NCBI website; find your sequences and display them in FASTA format. Each sequence should look like this (http://www.ncbi.nlm.nih.gov/protein/6678177?report=fasta): >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG 2) In a text editor, paste all your sequences together (in the order that you would like them to appear in the end). It should look like this: >gi|6678177|ref|NP_033320.1| syntaxin-4 [Mus musculus] MRDRTHELRQGDNISDDEDEVRVALVVHSGAARLGSPDDEFFQKVQTIRQTMAKLESKVRELEKQQVTIL ATPLPEESMKQGLQNLREEIKQLGREVRAQLKAIEPQKEEADENYNSVNTRMKKTQHGVLSQQFVELINK CNSMQSEYREKNVERIRRQLKITNAGMVSDEELEQMLDSGQSEVFVSNILKDTQVTRQALNEISARHSEI QQLERSIRELHEIFTFLATEVEMQGEMINRIEKNILSSADYVERGQEHVKIALENQKKARKKKVMIAICV SVTVLILAVIIGITITVG >gi|151554658|gb|AAI47965.1| STX3 protein [Bos taurus] MKDRLEQLKAKQLTQDDDTDEVEIAVDNTAFMDEFFSEIEETRVNIDKISEHVEEAKRLYSVILSAPIPE PKTKDDLEQLTTEIKKRANNVRNKLKSMERHIEEDEVQSSADLRIRKSQHSVLSRKFVEVMTKYNEAQVD FRERSKGRIQRQLEITGKKTTDEELEEMLESGNPAIFTSGIIDSQISKQALSEIEGRHKDIVRLESSIKE LHDMFMDIAMLVENQGEMLDNIELNVMHTVDHVEKAREETKRAVKYQGQARKKLVIIIVIVVVLLGILAL IIGLSVGLK
    [Show full text]
  • Coffeescript Accelerated Javascript Development.Pdf
    Download from Wow! eBook <www.wowebook.com> What readers are saying about CoffeeScript: Accelerated JavaScript Development It’s hard to imagine a new web application today that doesn’t make heavy use of JavaScript, but if you’re used to something like Ruby, it feels like a significant step down to deal with JavaScript, more of a chore than a joy. Enter CoffeeScript: a pre-compiler that removes all the unnecessary verbosity of JavaScript and simply makes it a pleasure to write and read. Go, go, Coffee! This book is a great introduction to the world of CoffeeScript. ➤ David Heinemeier Hansson Creator, Rails Just like CoffeeScript itself, Trevor gets straight to the point and shows you the benefits of CoffeeScript and how to write concise, clear CoffeeScript code. ➤ Scott Leberknight Chief Architect, Near Infinity Though CoffeeScript is a new language, you can already find it almost everywhere. This book will show you just how powerful and fun CoffeeScript can be. ➤ Stan Angeloff Managing Director, PSP WebTech Bulgaria Download from Wow! eBook <www.wowebook.com> This book helps readers become better JavaScripters in the process of learning CoffeeScript. What’s more, it’s a blast to read, especially if you are new to Coffee- Script and ready to learn. ➤ Brendan Eich Creator, JavaScript CoffeeScript may turn out to be one of the great innovations in web application development; since I first discovered it, I’ve never had to write a line of pure JavaScript. I hope the readers of this wonderful book will be able to say the same. ➤ Dr. Nic Williams CEO/Founder, Mocra CoffeeScript: Accelerated JavaScript Development is an excellent guide to Coffee- Script from one of the community’s most esteemed members.
    [Show full text]
  • Coffeescript Accelerated Javascript Development, Second Edition
    Extracted from: CoffeeScript Accelerated JavaScript Development, Second Edition This PDF file contains pages extracted from CoffeeScript, published by the Prag- matic Bookshelf. For more information or to purchase a paperback or PDF copy, please visit http://www.pragprog.com. Note: This extract contains some colored text (particularly in code listing). This is available only in online versions of the books. The printed versions are black and white. Pagination might vary between the online and printed versions; the content is otherwise identical. Copyright © 2015 The Pragmatic Programmers, LLC. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior consent of the publisher. The Pragmatic Bookshelf Dallas, Texas • Raleigh, North Carolina CoffeeScript Accelerated JavaScript Development, Second Edition Trevor Burnham The Pragmatic Bookshelf Dallas, Texas • Raleigh, North Carolina Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and The Pragmatic Programmers, LLC was aware of a trademark claim, the designations have been printed in initial capital letters or in all capitals. The Pragmatic Starter Kit, The Pragmatic Programmer, Pragmatic Programming, Pragmatic Bookshelf, PragProg and the linking g device are trade- marks of The Pragmatic Programmers, LLC. Every precaution was taken in the preparation of this book. However, the publisher assumes no responsibility for errors or omissions, or for damages that may result from the use of information (including program listings) contained herein. Our Pragmatic courses, workshops, and other products can help you and your team create better software and have more fun.
    [Show full text]
  • Ple Sequence Alignment Methods: Evidence from Data
    Mul$ple sequence alignment methods: evidence from data Tandy Warnow Alignment Error/Accuracy • SPFN: percentage of homologies in the true alignment that are not recovered (false negave homologies) • SPFP: percentage of homologies in the es$mated alignment that are false (false posi$ve homologies) • TC: total number of columns correctly recovered • SP-score: percentage of homologies in the true alignment that are recovered • Pairs score: 1-(avg of SP-FN and SP-FP) Benchmarks • Simulaons: can control everything, and true alignment is not disputed – Different simulators • Biological: can’t control anything, and reference alignment might not be true alignment – BAliBASE, HomFam, Prefab – CRW (Comparave Ribosomal Website) Alignment Methods (Sample) • Clustal-Omega • MAFFT • Muscle • Opal • Prank/Pagan • Probcons Co-es$maon of trees and alignments • Bali-Phy and Alifritz (stas$cal co-es$maon) • SATe-1, SATe-2, and PASTA (divide-and-conquer co- es$maon) • POY and Beetle (treelength op$mizaon) Other Criteria • Tree topology error • Tree branch length error • Gap length distribu$on • Inser$on/dele$on rao • Alignment length • Number of indels How does the guide tree impact accuracy? • Does improving the accuracy of the guide tree help? • Do all alignment methods respond iden$cally? (Is the same guide tree good for all methods?) • Do the default sengs for the guide tree work well? Alignment criteria • Does the relave performance of methods depend on the alignment criterion? • Which alignment criteria are predic$ve of tree accuracy? • How should we design MSA methods to produce best accuracy? Choice of best MSA method • Does it depend on type of data (DNA or amino acids?) • Does it depend on rate of evolu$on? • Does it depend on gap length distribu$on? • Does it depend on existence of fragments? Katoh and Standley .
    [Show full text]
  • Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods
    International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods Arunima Mishra, B. K. Tripathi, S. S. Soam MSA is a well-known method of alignment of three or more Abstract: Protein Multiple sequence alignment (MSA) is a biological sequences. Multiple sequence alignment is a very process, that helps in alignment of more than two protein intricate problem, therefore, computation of exact MSA is sequences to establish an evolutionary relationship between the only feasible for the very small number of sequences which is sequences. As part of Protein MSA, the biological sequences are not practical in real situations. Dynamic programming as used aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and in pairwise sequence method is impractical for a large number hence the volume of biological data generated is increasing at an of sequences while performing MSA and therefore the enormous rate. This increase in volume of data poses a challenge heuristic algorithms with approximate approaches [7] have to the existing methods used to perform effective MSA as with the been proved more successful. Generally, various biological increase in data volume the computational complexities also sequences are organized into a two-dimensional array such increases and the speed to process decreases. The accuracy of that the residues in each column are homologous or having the MSA is another factor critically important as many bioinformatics same functionality. Many MSA methods were developed over inferences are dependent on the output of MSA.
    [Show full text]
  • Determination of Optimal Parameters of MAFFT Program Based on Balibase3.0 Database
    Long et al. SpringerPlus (2016) 5:736 DOI 10.1186/s40064-016-2526-5 RESEARCH Open Access Determination of optimal parameters of MAFFT program based on BAliBASE3.0 database HaiXia Long1, ManZhi Li2* and HaiYan Fu1 *Correspondence: myresearch_hainnu@163. Abstract com Background: Multiple sequence alignment (MSA) is one of the most important 2 School of Mathematics and Statistics, Hainan Normal research contents in bioinformatics. A number of MSA programs have emerged. The University, Haikou 571158, accuracy of MSA programs highly depends on the parameters setting, mainly including HaiNan, China gap open penalties (GOP), gap extension penalties (GEP) and substitution matrix (SM). Full list of author information is available at the end of the This research tries to obtain the optimal GOP, GEP and SM rather than MAFFT default article parameters. Results: The paper discusses the MAFFT program benchmarked on BAliBASE3.0 data- base, and the optimal parameters of MAFFT program are obtained, which are better than the default parameters of CLUSTALW and MAFFT program. Conclusions: The optimal parameters can improve the results of multiple sequence alignment, which is feasible and efficient. Keywords: Multiple sequence alignment, Gap open penalties, Gap extension penalties, Substitution matrix, MAFFT program, Default parameters Background Multiple sequence alignment (MSA), one of the most basic bioinformatics tool, has wide applications in sequence analysis, gene recognition, protein structure prediction, and phylogenetic tree reconstruction, etc. MSA computation is a NP-complete problem (Lathrop 1995), whose time and space complexity have sharp increase while the length and the number of sequences are increasing. At present, many scholars have developed open source online alignment tools, such as CLUSTALW, T-COFFEE, MAFFT, (Thompson et al.
    [Show full text]
  • The Art of Multiple Sequence Alignment in R
    The Art of Multiple Sequence Alignment in R Erik S. Wright May 19, 2021 Contents 1 Introduction 1 2 Alignment Speed 2 3 Alignment Accuracy 4 4 Recommendations for optimal performance 7 5 Single Gene Alignment 8 5.1 Example: Protein coding sequences . 8 5.2 Example: Non-coding RNA sequences . 9 5.3 Example: Aligning two aligned sequence sets . 9 6 Advanced Options & Features 10 6.1 Example: Building a Guide Tree . 10 6.2 Example: Post-processing an existing multiple alignment . 12 7 Aligning Homologous Regions of Multiple Genomes 12 8 Session Information 15 1 Introduction This document is intended to illustrate the art of multiple sequence alignment in R using DECIPHER. Even though its beauty is often con- cealed, multiple sequence alignment is a form of art in more ways than one. Take a look at Figure 1 for an illustration of what is happening behind the scenes during multiple sequence alignment. The practice of sequence alignment is one that requires a degree of skill, and it is that art which this vignette intends to convey. It is simply not enough to \plug" sequences into a multiple sequence aligner and blindly trust the result. An appreciation for the art as well a careful consideration of the results are required. What really is multiple sequence alignment, and is there a single cor- rect alignment? Generally speaking, alignment seeks to perform the act of taking multiple divergent biological sequences of the same \type" and Figure 1: The art of multiple se- fitting them to a form that reflects some shared quality.
    [Show full text]
  • Multiple Sequence Alignment Based on Profile Alignment of Intermediate
    Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Yue Lu1 and Sing-Hoi Sze1,2 1 Department of Biochemistry & Biophysics 2 Department of Computer Science, Texas A&M University, College Station, TX 77843, USA Abstract. Despite considerable efforts, it remains difficult to obtain accurate multiple sequence alignments. By using additional hits from database search of the input sequences, a few strategies have been pro- posed to significantly improve alignment accuracy, including the con- struction of profiles from the hits while performing profile alignment, the inclusion of high scoring hits into the input sequences, the use of interme- diate sequence search to link distant homologs, and the use of secondary structure information. We develop an algorithm that integrates these strategies to further improve alignment accuracy by modifying the pair- HMM approach in ProbCons to incorporate profiles of intermediate se- quences from database search and utilize secondary structure predictions as in SPEM. We test our algorithm on a few sets of benchmark multi- ple alignments, including BAliBASE, HOMSTRAD, PREFAB and SAB- mark, and show that it significantly outperforms MAFFT and ProbCons, which are among the best multiple alignment algorithms that do not uti- lize additional information, and SPEM, which is among the best multiple alignment algorithms that utilize additional hits from database search. The improvement in accuracy over SPEM can be as much as 5 to 10% when aligning divergent sequences. A software program that implements this approach (ISPAlign) is at http://faculty.cs.tamu.edu/shsze/ispalign. 1 Introduction Although many algorithms have been proposed for multiple sequence alignment (Thompson et al.
    [Show full text]
  • Introduction to Linux for Bioinformatics – Part II Paul Stothard, 2006-09-20
    Introduction to Linux for bioinformatics – part II Paul Stothard, 2006-09-20 In the previous guide you learned how to log in to a Linux account, and you were introduced to some basic Linux commands. This section covers some more advanced commands and features of the Linux operating system. It also introduces some command-line bioinformatics programs. One important aspect of using a Linux system from a Windows or Mac environment that was not discussed in the previous section is how to transfer files between computers. For example, you may have a collection of sequence records on your Windows desktop that you wish to analyze using a Linux command-line program. Alternatively, you may want to transfer some sequence analysis results from a Linux system to your Mac so that you can add them to a PowerPoint presentation. Transferring files between Mac OS X and Linux Recall that Mac OS X includes a Terminal application (located in the Applications >> Utilities folder), which can be used to log in to other systems. This terminal can also be used to transfer files, thanks to the scp command. Try transferring a file from your Mac to your Linux account using the Terminal application: 1. Launch the Terminal program. 2. Instead of logging in to your Linux account, use the same basic commands you learned in the previous section (pwd, ls, and cd) to navigate your Mac file system. 3. Switch to your home directory on the Mac using the command cd ~ 4. Create a text file containing your home directory listing using ls -l > myfiles.txt 5.
    [Show full text]
  • Multiple Sequence Alignments Iain M Wallace, Gordon Blackshields and Desmond G Higgins
    Multiple sequence alignments Iain M Wallace, Gordon Blackshields and Desmond G Higgins Multiple sequence alignments are very widely used in all areas ‘progressive alignment’. This was described in various of DNA and protein sequence analysis. The main methods ways by several different groups and resulted in a series of that are still in use are based on ‘progressive alignment’ and programs in the mid to late 1980s that are still in use today date from the mid to late 1980s. Recently, some dramatic [8–12]. Progressive alignment allows large alignments of improvements have been made to the methodology with distantly related sequences to be constructed quickly and respect either to speed and capacity to deal with large numbers simply. It is based on building the full alignment up of sequences or to accuracy. There have also been some progressively, using the branching order of a quick practical advances concerning how to combine three- approximate tree (called the guide tree) to guide the dimensional structural information with primary sequences alignments. It is implemented in the most widely used to give more accurate alignments, when structures are programs (ClustalW [13] and ClustalX [14]), but is also available. used as the optimiser for other programs, such as T-Coffee [15]. The latter is unusual in that it uses the Addresses maximum weight trace [16] or ‘Coffee’ [17] objective The Conway Institute of Biomolecular and Biomedical Research, function, rather than a more conventional dynamic pro- University College Dublin, Ireland gramming sequence distance score. T-Coffee is of great Corresponding author: Higgins, Desmond G ([email protected]) interest not only because of the way it allows heteroge- neous data to be merged in alignments (see 3D-Coffee below) but also as a precursor to the probabilistic-based Current Opinion in Structural Biology 2005, 15:261–266 program PROBCONS [18], which is the most accurate This review comes from a themed issue on method available.
    [Show full text]