Ple Sequence Alignment Methods: Evidence from Data

Total Page:16

File Type:pdf, Size:1020Kb

Ple Sequence Alignment Methods: Evidence from Data Mul$ple sequence alignment methods: evidence from data Tandy Warnow Alignment Error/Accuracy • SPFN: percentage of homologies in the true alignment that are not recovered (false negave homologies) • SPFP: percentage of homologies in the es$mated alignment that are false (false posi$ve homologies) • TC: total number of columns correctly recovered • SP-score: percentage of homologies in the true alignment that are recovered • Pairs score: 1-(avg of SP-FN and SP-FP) Benchmarks • Simulaons: can control everything, and true alignment is not disputed – Different simulators • Biological: can’t control anything, and reference alignment might not be true alignment – BAliBASE, HomFam, Prefab – CRW (Comparave Ribosomal Website) Alignment Methods (Sample) • Clustal-Omega • MAFFT • Muscle • Opal • Prank/Pagan • Probcons Co-es$maon of trees and alignments • Bali-Phy and Alifritz (stas$cal co-es$maon) • SATe-1, SATe-2, and PASTA (divide-and-conquer co- es$maon) • POY and Beetle (treelength op$mizaon) Other Criteria • Tree topology error • Tree branch length error • Gap length distribu$on • Inser$on/dele$on rao • Alignment length • Number of indels How does the guide tree impact accuracy? • Does improving the accuracy of the guide tree help? • Do all alignment methods respond iden$cally? (Is the same guide tree good for all methods?) • Do the default sengs for the guide tree work well? Alignment criteria • Does the relave performance of methods depend on the alignment criterion? • Which alignment criteria are predic$ve of tree accuracy? • How should we design MSA methods to produce best accuracy? Choice of best MSA method • Does it depend on type of data (DNA or amino acids?) • Does it depend on rate of evolu$on? • Does it depend on gap length distribu$on? • Does it depend on existence of fragments? Katoh and Standley . doi:10.1093/molbev/mst010 MBE Table 2. Comparison of Different Options Using the 16S.B.ALL Data Set (Mirarab et al. 2012). Data Method Accuracy CPU Time Actual Timea Case 1 mafft ––multipair ––addfragments frags existingmsa 0.9969 6.67 days 18.3 h mafft ––6merpair ––addfragments frags existingmsa 0.9949 3.76 h 36.2 min mafft ––localpair ––add frags existingmsa 0.9707 23.4 daysb 2.43 daysb mafft ––6merpair ––add frags existingmsa 0.9604 1.32 h 1.44 h profile alignment 0.2779 15.5 h 1.60 h Case 2 mafft ––6merpair ––addfragments frags existingmsa 0.9969 4.54 h 33.8 min Case 3 mafft ––6merpair ––addfragments frags existingmsa 0.9949 1.79 days 5.91 h NOTE.—The estimated alignments were compared with the CRW alignment to measure the accuracy (the number of correctly aligned letters/the number of aligned letters in the CRW alignment). Calculations were performed on a Linux PC with 2.67 GHz Intel Xeon E7-8837/256 GB RAM (for the case marked with superscript alphabet “b”), or on a Linux PC with 3.47 GHz Intel Xeon X5690/48 GB RAM (for the other cases). Case 1: 13,822 sequences in the existing alignment 13,821 fragments; Case 2: 1,000 sequences in the existing alignment Â138,210 fragments; Case 3: 13,822 sequences in the existing alignment 138,210 fragments.  aWall-clock time with 10 cores. Command-line argument for parallel processing is ––thread 10. bFull command-line options are as follows: mafft ––localpair ––weighti 0 ––add frags existingmsa. to each other. Even if a more computationally expensive (and one. The difficulty of this problem for standard approaches Downloaded from usually more accurate) method, L-INS-i, is applied (CPU time- comes from the fact that ITS1 sequences and ITS2 sequences =98h),thealignmentisstillobviouslyincorrect(fig. 2B). are not homologous to each other and most pairwise align- Two-step strategiesFrom canKatoh solve this and typeStandley of problem., 2013 (dealing with fragmentary sequences) That is, ments are impossible. Because of these nonhomologous pairs, asetoffull-lengthsequencestakenfromdatabasesarefirstMol. Biol. Evol. 30(4):772–780 doi:10.1093/the distancemolbev matrix/mst010 used for the guide tree calculation is aligned to build a backbone MSA, and then the new ITS1 and not additive; the distances between ITS1 and full-length http://mbe.oxfordjournals.org/ ITS2 sequences are added into this backbone MSA, using the sequences and those between ITS2 and full-length sequences ––addfragments option. are close to zero, whereas the distances between ITS1 and ITS2 are quite large. In this situation, it is difficult for normal Step 1: mafft - -auto full_length_sequences >\ distance-based tree-buildingmethodstogiveareasonable backbone_msa tree. Moreover, in the alignment step, the objective function Step 2: mafft - -addfragments \ new_sequences of the L-INS-i is affected by inappropriate pairwise alignment backbone_msa > output scores between ITS1 and ITS2. Such problems can be avoided by just ignoring the relationship between ITS1 and ITS2, The second command is equivalent to by guest on February 22, 2015 as done in the ––addfragments option. mafft - -multipair - -addfragments \ In addition, a result of the second type of misuse of new_sequences backbone_msa > output mafft-profile (discussed earlier) is shown in figure 2C. in which Dynamic Programming (DP) is used to compare the Some new sequences are correctly aligned but others are distances between every new sequence and every sequence in obviously incorrectly aligned (note that the order of se- the backbone MSA (––multipair is selected by default). quences in fig. 2C is identical that in fig. 2D). These misalign- ments are due to an incorrect assumption on phylogenetic mafft - -6merpair - -addfragments \ new_sequences backbone_msa > output placement of new sequences shown in figure 1C. where distances are rapidly estimated using the number of Test Case 2: Bacterial SSU rRNA shared 6mers, instead of DP. Another case is the 16S.B.ALL data set by Mirarab et al. (2012). The result of the latter option (––6merpair It consists of an MSA of 13,822 bacterial SSU rRNA sequences, ––addfragments)isshowninfig. 2D and E.Thedifference taken from the Gutell Comparative RNA Website (CRW) between D and E is just in the order of sequences; the se- (Cannone et al. 2002)and138,210fragmentarysequences, quences were reordered according to similarity using the which are originally included in the CRW alignment ––reorder option in E.Inthisalignment,ITS1andITS2 but ungapped and artificially truncated. In Katoh and are clearly separated and aligned to appropriate positions in Standley (2013),weusedasubset(13,821fragmentary the full-length alignment. Moreover, this strategy is compu- sequences) prepared by Mirarab et al. (2012).Inadditionto tationally much less expensive (CPU time = 15 min [first this subset, here we use the full data set (138,210 fragmentary step] + 1.5 min [second step]) than the full application sequences), to examine the scalability. Suppose a situation of L-INS-i (CPU time = 98 h). The former option where we already have a manually curated (or backbone) (––multipair––addfragments)alsoreturnsasimilar MSA and a newly determined set of many fragmentary result to the latter (––6merpair) but is slower (CPU sequences in a metagenomics project, and we need an time = 48.6 min [second step]). entire MSA of them. This case suggests that it is crucial to select a strategy The first four lines in table 2 (case 1) show the perfor- appropriate to the problem of interest. The most time- mances of various options for such an analysis, with a rela- consuming method, L-INS-i, is not always the most accurate tively small data set (13,822 sequences in the existing 776 Important! • Each method can be run in different ways – so you need to know the exact command used, to be able to evaluate performance. (You also need to know the version number!) Clustal-Omega study • Clustal-Omega (Seivers et al., Molecular Systems Biology 2011) is the latest in the Clustal family of MSA methods • Clustal-Omega is designed primarily for amino acid alignment, but can be used on nucleo$de datasets • Alignment criterion: TC (column score) • Datasets: biological with structural alignments High-quality protein MSAs using Clustal Omega F Sievers et al attention to gaps. These gap positions are not included in these tests as they tend not to be structurally conserved. Dialign (Morgenstern et al, 1998) does not use consistency or progressive alignment but is based on finding best local multiple Consistency alignments. FSA (Bradley et al,2009)usessamplingofpairwise alignments and ‘sequence annealing’ and has been shown to deliver good nucleotide sequence alignments in the past. The Prefab benchmark test results are shown in Table II. Here, the results are divided into five groups according to the percent identity of the sequences. The overall scores range families) from 53 to 73% of columns correct. The consistency-based Total time (s) (1682 programs MSAprobs, MAFFT L-INS-i, Probalign, Probcons and T-Coffee, are again the most accurate but with long run times. Clustal Omega is close to the consistency programs in accuracy but is much faster. There is then a gap to the faster progressive based programs of MUSCLE, MAFFT, Kalign (Lassmann and 100 (90 Sonnhammer, 2005) and Clustal W. p Results from testing large alignments with up to 50000 families) %ID sequences are given in Table III using HomFam. Here, each p in seconds is shown in the second last column. The last column indicates alignment is made up of a core of a Homstrad (Mizuguchi et al, 70 1998) structure-based alignment of at least five sequences. These sequences are then inserted into a test set of sequences from the corresponding, homologous, Pfam domain. This gives very large sets of sequences to be aligned but the testing is only carried out on the sequences with known structures. Only some programs 70 (117 are able to deliver alignments at all, with data sets of this size. We p families) restricted the comparisons toClustalOmega,MAFFT,MUSCLE %ID and Kalign.
Recommended publications
  • T-Coffee Documentation Release Version 13.45.47.Aba98c5
    T-Coffee Documentation Release Version_13.45.47.aba98c5 Cedric Notredame Aug 31, 2021 Contents 1 T-Coffee Installation 3 1.1 Installation................................................3 1.1.1 Unix/Linux Binaries......................................4 1.1.2 MacOS Binaries - Updated...................................4 1.1.3 Installation From Source/Binaries downloader (Mac OSX/Linux)...............4 1.2 Template based modes: PSI/TM-Coffee and Expresso.........................5 1.2.1 Why do I need BLAST with T-Coffee?.............................6 1.2.2 Using a BLAST local version on Unix.............................6 1.2.3 Using the EBI BLAST client..................................6 1.2.4 Using the NCBI BLAST client.................................7 1.2.5 Using another client.......................................7 1.3 Troubleshooting.............................................7 1.3.1 Third party packages......................................7 1.3.2 M-Coffee parameters......................................9 1.3.3 Structural modes (using PDB)................................. 10 1.3.4 R-Coffee associated packages................................. 10 2 Quick Start Regressive Algorithm 11 2.1 Introduction............................................... 11 2.2 Installation from source......................................... 12 2.3 Examples................................................. 12 2.3.1 Fast and accurate........................................ 12 2.3.2 Slower and more accurate.................................... 12 2.3.3 Very Fast...........................................
    [Show full text]
  • Multiple Sequence Alignment: a Major Challenge to Large-Scale Phylogenetics
    Multiple sequence alignment: a major challenge to large-scale phylogenetics November 18, 2010 · Tree of Life Kevin Liu1, C. Randal Linder2, Tandy Warnow3 1 Postdoctoral Research Associate, Austin, TX, 2 Associate Professor, Austin, TX, 3 Professor of Computer Science, Department of Computer Science, University of Texas at Austin Liu K, Linder CR, Warnow T. Multiple sequence alignment: a major challenge to large-scale phylogenetics. PLOS Currents Tree of Life. 2010 Nov 18 [last modified: 2012 Apr 23]. Edition 1. doi: 10.1371/currents.RRN1198. Abstract Over the last decade, dramatic advances have been made in developing methods for large-scale phylogeny estimation, so that it is now feasible for investigators with moderate computational resources to obtain reasonable solutions to maximum likelihood and maximum parsimony, even for datasets with a few thousand sequences. There has also been progress on developing methods for multiple sequence alignment, so that greater alignment accuracy (and subsequent improvement in phylogenetic accuracy) is now possible through automated methods. However, these methods have not been tested under conditions that reflect properties of datasets confronted by large-scale phylogenetic estimation projects. In this paper we report on a study that compares several alignment methods on a benchmark collection of nucleotide sequence datasets of up to 78,132 sequences. We show that as the number of sequences increases, the number of alignment methods that can analyze the datasets decreases. Furthermore, the most accurate alignment methods are unable to analyze the very largest datasets we studied, so that only moderately accurate alignment methods can be used on the largest datasets. As a result, alignments computed for large datasets have relatively large error rates, and maximum likelihood phylogenies computed on these alignments also have high error rates.
    [Show full text]
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods
    International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 Performance Evaluation of Leading Protein Multiple Sequence Alignment Methods Arunima Mishra, B. K. Tripathi, S. S. Soam MSA is a well-known method of alignment of three or more Abstract: Protein Multiple sequence alignment (MSA) is a biological sequences. Multiple sequence alignment is a very process, that helps in alignment of more than two protein intricate problem, therefore, computation of exact MSA is sequences to establish an evolutionary relationship between the only feasible for the very small number of sequences which is sequences. As part of Protein MSA, the biological sequences are not practical in real situations. Dynamic programming as used aligned in a way to identify maximum similarities. Over time the sequencing technologies are becoming more sophisticated and in pairwise sequence method is impractical for a large number hence the volume of biological data generated is increasing at an of sequences while performing MSA and therefore the enormous rate. This increase in volume of data poses a challenge heuristic algorithms with approximate approaches [7] have to the existing methods used to perform effective MSA as with the been proved more successful. Generally, various biological increase in data volume the computational complexities also sequences are organized into a two-dimensional array such increases and the speed to process decreases. The accuracy of that the residues in each column are homologous or having the MSA is another factor critically important as many bioinformatics same functionality. Many MSA methods were developed over inferences are dependent on the output of MSA.
    [Show full text]
  • Determination of Optimal Parameters of MAFFT Program Based on Balibase3.0 Database
    Long et al. SpringerPlus (2016) 5:736 DOI 10.1186/s40064-016-2526-5 RESEARCH Open Access Determination of optimal parameters of MAFFT program based on BAliBASE3.0 database HaiXia Long1, ManZhi Li2* and HaiYan Fu1 *Correspondence: myresearch_hainnu@163. Abstract com Background: Multiple sequence alignment (MSA) is one of the most important 2 School of Mathematics and Statistics, Hainan Normal research contents in bioinformatics. A number of MSA programs have emerged. The University, Haikou 571158, accuracy of MSA programs highly depends on the parameters setting, mainly including HaiNan, China gap open penalties (GOP), gap extension penalties (GEP) and substitution matrix (SM). Full list of author information is available at the end of the This research tries to obtain the optimal GOP, GEP and SM rather than MAFFT default article parameters. Results: The paper discusses the MAFFT program benchmarked on BAliBASE3.0 data- base, and the optimal parameters of MAFFT program are obtained, which are better than the default parameters of CLUSTALW and MAFFT program. Conclusions: The optimal parameters can improve the results of multiple sequence alignment, which is feasible and efficient. Keywords: Multiple sequence alignment, Gap open penalties, Gap extension penalties, Substitution matrix, MAFFT program, Default parameters Background Multiple sequence alignment (MSA), one of the most basic bioinformatics tool, has wide applications in sequence analysis, gene recognition, protein structure prediction, and phylogenetic tree reconstruction, etc. MSA computation is a NP-complete problem (Lathrop 1995), whose time and space complexity have sharp increase while the length and the number of sequences are increasing. At present, many scholars have developed open source online alignment tools, such as CLUSTALW, T-COFFEE, MAFFT, (Thompson et al.
    [Show full text]
  • The Art of Multiple Sequence Alignment in R
    The Art of Multiple Sequence Alignment in R Erik S. Wright May 19, 2021 Contents 1 Introduction 1 2 Alignment Speed 2 3 Alignment Accuracy 4 4 Recommendations for optimal performance 7 5 Single Gene Alignment 8 5.1 Example: Protein coding sequences . 8 5.2 Example: Non-coding RNA sequences . 9 5.3 Example: Aligning two aligned sequence sets . 9 6 Advanced Options & Features 10 6.1 Example: Building a Guide Tree . 10 6.2 Example: Post-processing an existing multiple alignment . 12 7 Aligning Homologous Regions of Multiple Genomes 12 8 Session Information 15 1 Introduction This document is intended to illustrate the art of multiple sequence alignment in R using DECIPHER. Even though its beauty is often con- cealed, multiple sequence alignment is a form of art in more ways than one. Take a look at Figure 1 for an illustration of what is happening behind the scenes during multiple sequence alignment. The practice of sequence alignment is one that requires a degree of skill, and it is that art which this vignette intends to convey. It is simply not enough to \plug" sequences into a multiple sequence aligner and blindly trust the result. An appreciation for the art as well a careful consideration of the results are required. What really is multiple sequence alignment, and is there a single cor- rect alignment? Generally speaking, alignment seeks to perform the act of taking multiple divergent biological sequences of the same \type" and Figure 1: The art of multiple se- fitting them to a form that reflects some shared quality.
    [Show full text]
  • Multiple Sequence Alignment Based on Profile Alignment of Intermediate
    Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences Yue Lu1 and Sing-Hoi Sze1,2 1 Department of Biochemistry & Biophysics 2 Department of Computer Science, Texas A&M University, College Station, TX 77843, USA Abstract. Despite considerable efforts, it remains difficult to obtain accurate multiple sequence alignments. By using additional hits from database search of the input sequences, a few strategies have been pro- posed to significantly improve alignment accuracy, including the con- struction of profiles from the hits while performing profile alignment, the inclusion of high scoring hits into the input sequences, the use of interme- diate sequence search to link distant homologs, and the use of secondary structure information. We develop an algorithm that integrates these strategies to further improve alignment accuracy by modifying the pair- HMM approach in ProbCons to incorporate profiles of intermediate se- quences from database search and utilize secondary structure predictions as in SPEM. We test our algorithm on a few sets of benchmark multi- ple alignments, including BAliBASE, HOMSTRAD, PREFAB and SAB- mark, and show that it significantly outperforms MAFFT and ProbCons, which are among the best multiple alignment algorithms that do not uti- lize additional information, and SPEM, which is among the best multiple alignment algorithms that utilize additional hits from database search. The improvement in accuracy over SPEM can be as much as 5 to 10% when aligning divergent sequences. A software program that implements this approach (ISPAlign) is at http://faculty.cs.tamu.edu/shsze/ispalign. 1 Introduction Although many algorithms have been proposed for multiple sequence alignment (Thompson et al.
    [Show full text]
  • HMMER User's Guide
    HMMER User’s Guide Biological sequence analysis using profile hidden Markov models http://hmmer.org/ Version 3.0rc1; February 2010 Sean R. Eddy for the HMMER Development Team Janelia Farm Research Campus 19700 Helix Drive Ashburn VA 20147 USA http://eddylab.org/ Copyright (C) 2010 Howard Hughes Medical Institute. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. HMMER is licensed and freely distributed under the GNU General Public License version 3 (GPLv3). For a copy of the License, see http://www.gnu.org/licenses/. HMMER is a trademark of the Howard Hughes Medical Institute. 1 Contents 1 Introduction 5 How to avoid reading this manual . 5 How to avoid using this software (links to similar software) . 5 What profile HMMs are . 5 Applications of profile HMMs . 6 Design goals of HMMER3 . 7 What’s still missing in HMMER3 . 8 How to learn more about profile HMMs . 9 2 Installation 10 Quick installation instructions . 10 System requirements . 10 Multithreaded parallelization for multicores is the default . 11 MPI parallelization for clusters is optional . 11 Using build directories . 12 Makefile targets . 12 3 Tutorial 13 The programs in HMMER . 13 Files used in the tutorial . 13 Searching a sequence database with a single profile HMM . 14 Step 1: build a profile HMM with hmmbuild . 14 Step 2: search the sequence database with hmmsearch . 16 Searching a profile HMM database with a query sequence . 22 Step 1: create an HMM database flatfile . 22 Step 2: compress and index the flatfile with hmmpress .
    [Show full text]
  • Multiple Sequence Alignments Iain M Wallace, Gordon Blackshields and Desmond G Higgins
    Multiple sequence alignments Iain M Wallace, Gordon Blackshields and Desmond G Higgins Multiple sequence alignments are very widely used in all areas ‘progressive alignment’. This was described in various of DNA and protein sequence analysis. The main methods ways by several different groups and resulted in a series of that are still in use are based on ‘progressive alignment’ and programs in the mid to late 1980s that are still in use today date from the mid to late 1980s. Recently, some dramatic [8–12]. Progressive alignment allows large alignments of improvements have been made to the methodology with distantly related sequences to be constructed quickly and respect either to speed and capacity to deal with large numbers simply. It is based on building the full alignment up of sequences or to accuracy. There have also been some progressively, using the branching order of a quick practical advances concerning how to combine three- approximate tree (called the guide tree) to guide the dimensional structural information with primary sequences alignments. It is implemented in the most widely used to give more accurate alignments, when structures are programs (ClustalW [13] and ClustalX [14]), but is also available. used as the optimiser for other programs, such as T-Coffee [15]. The latter is unusual in that it uses the Addresses maximum weight trace [16] or ‘Coffee’ [17] objective The Conway Institute of Biomolecular and Biomedical Research, function, rather than a more conventional dynamic pro- University College Dublin, Ireland gramming sequence distance score. T-Coffee is of great Corresponding author: Higgins, Desmond G ([email protected]) interest not only because of the way it allows heteroge- neous data to be merged in alignments (see 3D-Coffee below) but also as a precursor to the probabilistic-based Current Opinion in Structural Biology 2005, 15:261–266 program PROBCONS [18], which is the most accurate This review comes from a themed issue on method available.
    [Show full text]
  • Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT
    22 Genome Informatics 16(1): 22-33 (2005) Improvement in the Accuracy of Multiple Sequence Alignment Program MAFFT Kazutaka Katohl Kei-ichi Kuma1 [email protected] [email protected] Takashi Miyata2" Hiroyuki Toh5 [email protected] [email protected] 1 Bioinformatics Center , Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan 2 Biohistory Research Hall , Takatsuki, Osaka 569-1125, Japan 3 Department of Electrical Engineering and Bioscience , Science and Engineering, Waseda University, Tokyo 169-8555, Japan 4 Department of Biophysics , Graduate School of Science, Kyoto University, Kyoto 606- 8502, Japan 5 Division of Bioinformatics , Research Center for Prevention of Infectious Diseases, Medical Institute of Bioregulation, Kyushu University, Fukuoka 812-8582,Japan Abstract In 2002, we developed and released a rapid multiple sequence alignment program MAFFT that was designed to handle a huge (up to•`5,000 sequences) and long data (,-2,000 as or •`5,000 nt) in a reasonable time on a standard desktop PC. As for the accuracy, however, the previous versions (v.4 and lower) of MAFFT were outperformed by ProbCons and TCoffee v.2, both of which were released in 2004, in several benchmark tests. Here we report a recent extension of MAFFT that aims to improve the accuracy with as little cost of calculation time as possible. The extended version of MAFFT (v.5) has new iterative refinement options, G-INS-i and L-INS-i (collectively denoted as [GL]-INS-iin this report). These options use a new objective function combining the weighted sum-of-pairs (WSP) score and a score similar to COFFEE derived from all pairwise alignments.
    [Show full text]
  • Multiple Sequence Alignment
    Bioinformatics Algorithms Multiple Sequence Alignment David Hoksza http://siret.ms.mff.cuni.cz/hoksza Outline • Motivation • Algorithms • Scoring functions • exhaustive • multidimensional dynamic programming • heuristics • progressive alignment • iterative alignment/refinement • block(local)-based alignment Multiple sequence alignment (MSA) • Goal of MSA is to find “optimal” mapping of a set of sequences • Homologous residues (originating in the same position in a common ancestor) among a set of sequences are aligned together in columns • Usually employs multiple pairwise alignment (PA) computations to reveal the evolutionarily equivalent positions across all sequences Motivation • Distant homologues • faint similarity can become apparent when present in many sequences • motifs might not be apparent from pairwise alignment only • Detection of key functional residues • amino acids critical for function tend to be conserved during the evolution and therefore can be revealed by inspecting sequences within given family • Prediction of secondary/tertiary structure • Inferring evolutionary history 4 Representation of MSA • Column-based representation • Profile representation (position specific scoring matrix) • Sequence logo Manual MSA • High quality MSA can be carried out automatic MSA algorithms by hand using expert knowledge • specific columns • BAliBASE • highly conserved residues • https://lbgi.fr/balibase/ • buried hydrophobic residues • PROSITE • secondary structure (especially in RNA • http://prosite.expasy.org/ alignment) • Pfam • expected
    [Show full text]
  • Human Genome Graph Construction from Multiple Long-Read Assemblies
    F1000Research 2018, 7:1391 Last updated: 14 OCT 2019 SOFTWARE TOOL ARTICLE NovoGraph: Human genome graph construction from multiple long-read de novo assemblies [version 2; peer review: 2 approved] A genome graph representation of seven ethnically diverse whole human genomes Evan Biederstedt1,2, Jeffrey C. Oliver3, Nancy F. Hansen4, Aarti Jajoo5, Nathan Dunn 6, Andrew Olson7, Ben Busby8, Alexander T. Dilthey 4,9 1Weill Cornell Medicine, New York, NY, 10065, USA 2New York Genome Center, New York, NY, 10013, USA 3Office of Digital Innovation and Stewardship, University Libraries, University of Arizona, Tucson, AZ, 85721, USA 4National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20817, USA 5Baylor College of Medicine, Houston, TX, 77030, USA 6Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA 7Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA 8National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD, 20817, USA 9Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, 40225, Germany First published: 03 Sep 2018, 7:1391 ( Open Peer Review v2 https://doi.org/10.12688/f1000research.15895.1) Latest published: 10 Dec 2018, 7:1391 ( https://doi.org/10.12688/f1000research.15895.2) Reviewer Status Abstract Invited Reviewers Genome graphs are emerging as an important novel approach to the 1 2 analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally version 2 report variable and hyperpolymorphic regions of the genome. In most existing published approaches, graphs are constructed from variant call sets derived from 10 Dec 2018 short-read sequencing.
    [Show full text]