A Comprehensive Resource for Helminth Genomics

Total Page:16

File Type:pdf, Size:1020Kb

Load more

G Model MOLBIO-11030; No. of Pages 9 ARTICLE IN PRESS Molecular & Biochemical Parasitology xxx (2016) xxx–xxx Contents lists available at ScienceDirect Molecular & Biochemical Parasitology − WormBase ParaSite a comprehensive resource for helminth genomics a,∗ a b a b Kevin L. Howe , Bruce J. Bolt , Myriam Shafie , Paul Kersey , Matthew Berriman a European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK b Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK a r t i c l e i n f o a b s t r a c t Article history: The number of publicly available parasitic worm genome sequences has increased dramatically in the Received 7 October 2016 past three years, and research interest in helminth functional genomics is now quickly gathering pace in Received in revised form response to the foundation that has been laid by these collective efforts. A systematic approach to the 24 November 2016 organisation, curation, analysis and presentation of these data is clearly vital for maximising the util- Accepted 25 November 2016 ity of these data to researchers. We have developed a portal called WormBase ParaSite (http://parasite. Available online xxx wormbase.org) for interrogating helminth genomes on a large scale. Data from over 100 nematode and platyhelminth species are integrated, adding value by way of systematic and consistent functional annota- Keywords: tion (e.g. protein domains and Gene Ontology terms), gene expression analysis (e.g. alignment of life-stage Genome browser specific transcriptome data sets), and comparative analysis (e.g. orthologues and paralogues). We provide Comparative genomics Functional genomics several ways of exploring the data, including genome browsers, genome and gene summary pages, text Helminths search, sequence search, a query wizard, bulk downloads, and programmatic interfaces. In this review, WormBase we provide an overview of the back-end infrastructure and analysis behind WormBase ParaSite, and the Ensembl displays and tools available to users for interrogating helminth genomic data. © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 1. Introduction agricultural importance and therefore attract research interest in their own right. The first parasitic nematode to have its genome The WormBase project (http://www.wormbase.org, [1]) was fully sequenced was Brugia malayi [3], and since then, many have initiated to facilitate and accelerate biological research that uses the followed [4–37]. Of the WormBase “core” species (the ones for model nematode Caenorhabditis elegans, by making the collected which the reference genome sequence and annotation are curated), outputs of scientists accessible from a single resource. This enables three are parasitic worms: Brugia malayi, Onchocerca volvulus and the transfer of this wealth of knowledge to the study of other meta- Strongyloides ratti (http://www.wormbase.org/species). However, zoa, from nematodes to humans. C. elegans data described in the there are a number of challenges associated with expanding the research literature, deposited in the archives, or submitted directly remit to helminths. Firstly, the primary research goal of parasitol- is placed into context via a combination of detailed manual cura- ogists is to identify ways of controlling the parasite, and as such tion and semi-automatic data integration. In addition, the project their desired entry points and common use-cases for WormBase curates the reference genome sequence, gene structures and other are often distinct from those of scientists doing basic science using genomic features for C. elegans, thereby providing a high quality C. elegans as a model. Secondly, the wider community of worm par- foundation for downstream studies. The WormBase mission also asitologists includes those studying platyhelminths (flatworms), extends to free-living relatives of C. elegans, which were the focus which are outside the taxonomic scope of WormBase, which is of early post-genomic research in the areas of evolution and com- a nematode resource. Thirdly, there have been recent concerted parative biology [2]. efforts to sequence the genomes of many helminths (nematodes In recent years, WormBase has begun to expand its mission and platyhelmiths) (e.g. the 50 Helminth Genomes Initiative [38]), to include plant and animal parasitic nematodes which, while resulting in a flood of new draft genomes that vary considerably in more distantly related to C. elegans, have direct biomedical and contiguity. In response to these challenges, we have created WormBase Par- aSite, a comprehensive new resource for parasitic worm genomes, ∗ aiming to serve parasitologists working on helminths who use Corresponding author. E-mail address: [email protected] (K.L. Howe). genomics as an investigative tool. WormBase ParaSite leverages the http://dx.doi.org/10.1016/j.molbiopara.2016.11.005 0166-6851/© 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4. 0/). Please cite this article in press as: K.L. Howe, et al., WormBase ParaSite − a comprehensive resource for helminth genomics, Mol Biochem Parasitol (2016), http://dx.doi.org/10.1016/j.molbiopara.2016.11.005 G Model MOLBIO-11030; No. of Pages 9 ARTICLE IN PRESS 2 K.L. Howe et al. / Molecular & Biochemical Parasitology xxx (2016) xxx–xxx Table 1 infrastructure and expertise of the WormBase project, with a main Species with RNA-Seq expression tracks in WormBase ParaSite (release 7). Included mission of breadth (foundational data for many species) rather than are some unpublished studies that are publicly available via the European depth (rich data for a single species). Nucleotide Archive. Species Studies Samples (tracks) Nematode Brugia malayi 3 [58,59] 57 2. Data integration and analysis Onchocerca volvulus 2 [60,61] 19 Strongyloides ratti 2 [34] 8 2.1. Genomes Strongyloides stercoralis 2 [62] 42 Trichuris muris 1 [25] 22 Our mission is to include all publicly available nematode and Platyhelminth Echinococcus granulosus 2 [15] 6 platyhelminth genomes in WormBase ParaSite. Where multi- Echinococcus multicularis 2 [15] 35 Fasciola hepatica 2 [31] 17 ple genome assemblies exist for the same species (e.g. different Hymenolepis microstoma 2 [15] 17 genome projects for Haemonchus contortus [18,19], and male and Schistosoma mansoni 6 [63–67] 33 female isolates for Trichuris suis [24]), we have included all. Release 7 of the resource (August 2016) included genomes from 82 nema- tode species (98 genomes) and 28 platyhelminth species (30 2.2.2. Functional annotation genomes). We maintain an up-to-date list of genomes for ease of Literature describing gene function is sparse for most of the reference (http://parasite.wormbase.org/species.html). species in WormBase ParaSite, although we anticipate that the Our primary source for genome sequences is the International availability of the genomes will stimulate new research. In lieu of Nucleotide Sequence Database Collaboration (INSDC) resources annotations based on experimental evidence, we have used estab- [39]. In rare cases, we collect genomes from project-specific FTP lished automated methods to predict the function of as many gene sites or direct engagement with genome project scientists. How- products as possible. Firstly, we broker the submission of all protein ever, it is our strong preference to use genomes deposited with sequences in WormBase ParaSite to the UniProt Knowledgebase INSDC as these have been formally checked, processed and ver- [51], and import the product names assigned by them. These are sioned by an authoritative sequence archive resource. As well as defined by a combination of manual curation and automatic anno- allowing us to formally disambiguate between different genomes tation. For more detailed annotation, we use InterProScan [52] from for the same species, by way of the INSDC BioProject identifier the InterPro project [53]. As well as predicting protein domains (e.g. (http://www.ncbi.nlm.gov/bioproject), it also simplifies the pro- from the Pfam [54] database), it also assigns terms from the Gene cess of integrating and displaying functional genomics data that has Ontology (GO) [55]. We run the latest version of the pipeline and been submitted to the archives. For the species in common between data across the complete proteome for every genome each release, WormBase and WormBase ParaSite (C. elegans and other free-living so that we are always up-to-date with the latest InterPro domain nematodes, Brugia malayi, Strongyloides ratti and Onchocerca volvu- and Gene Ontology annotations. lus), we synchronise the data with a specific release of WormBase. For example, WormBase ParaSite release 7 was synchronised with 2.2.3. Gene expression WormBase release WS254, meaning that data for species in com- High-throughput sequencing of RNA has become the standard mon were identical in both resources. assay for measuring gene expression, and numerous studies con- ducting “RNA-Seq” experiments in helminth species have now been performed and deposited in the sequence archives. We ultimately 2.2. Genome annotation aim to align all helminth RNA-Seq data to the corresponding refer- ence genome using a standard pipeline, in collaboration with the 2.2.1. Gene structures Gene Expression Atlas project [56] (see Section 4). In the meantime, The definition of the intron/exon structure of protein-coding we have processed data for a small number of species using our own genes is an important foundation for interpretation of genome pipeline, as a pilot study (Table 1). Briefly, we use STAR [57] to align function. WormBase curates gene structures for a C. elegans and a each experiment to the reference genome, merging resulting BAM small set of nematode species with high-quality reference genomes files when they are technical replicates of the same sample. We [40]. For others, we import annotations from the acknowledged have labelled each sample with the appropriate descriptive terms, authority for that genome (e.g.
Recommended publications
  • Original Article Text Mining in the Biocuration Workflow: Applications for Literature Curation at Wormbase, Dictybase and TAIR

    Original Article Text Mining in the Biocuration Workflow: Applications for Literature Curation at Wormbase, Dictybase and TAIR

    Database, Vol. 2012, Article ID bas040, doi:10.1093/database/bas040 ............................................................................................................................................................................................................................................................................................. Original article Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR Kimberly Van Auken1,*, Petra Fey2, Tanya Z. Berardini3, Robert Dodson2, Laurel Cooper4, Donghui Li3, Juancarlos Chan1, Yuling Li1, Siddhartha Basu2, Hans-Michael Muller1, Downloaded from Rex Chisholm2, Eva Huala3, Paul W. Sternberg1,5 and the WormBase Consortium 1Division of Biology, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, 2Northwestern University Biomedical Informatics Center and Center for Genetic Medicine, 420 E. Superior Street, Chicago, IL 60611, 3Department of Plant Biology, Carnegie Institution, 260 Panama Street, Stanford, CA 94305, 4Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331 and 5Howard Hughes Medical Institute, California Institute of Technology, 1200 E. California Boulevard, Pasadena, CA 91125, USA http://database.oxfordjournals.org/ *Corresponding author: Tel: +1 609 937 1635; Fax: +1 626 568 8012; Email: [email protected] Submitted 18 June 2012; Revised 30 September 2012; Accepted 2 October 2012 ............................................................................................................................................................................................................................................................................................
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The

    The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The

    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
  • Bioinformatics Study of Lectins: New Classification and Prediction In

    Bioinformatics Study of Lectins: New Classification and Prediction In

    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
  • Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

    Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

    ARTICLES https://doi.org/10.1038/s41564-019-0510-x Corrected: Author Correction Cryptic inoviruses revealed as pervasive in bacteria and archaea across Earth’s biomes Simon Roux 1*, Mart Krupovic 2, Rebecca A. Daly3, Adair L. Borges4, Stephen Nayfach1, Frederik Schulz 1, Allison Sharrar5, Paula B. Matheus Carnevali 5, Jan-Fang Cheng1, Natalia N. Ivanova 1, Joseph Bondy-Denomy4,6, Kelly C. Wrighton3, Tanja Woyke 1, Axel Visel 1, Nikos C. Kyrpides1 and Emiley A. Eloe-Fadrosh 1* Bacteriophages from the Inoviridae family (inoviruses) are characterized by their unique morphology, genome content and infection cycle. One of the most striking features of inoviruses is their ability to establish a chronic infection whereby the viral genome resides within the cell in either an exclusively episomal state or integrated into the host chromosome and virions are continuously released without killing the host. To date, a relatively small number of inovirus isolates have been extensively studied, either for biotechnological applications, such as phage display, or because of their effect on the toxicity of known bacterial pathogens including Vibrio cholerae and Neisseria meningitidis. Here, we show that the current 56 members of the Inoviridae family represent a minute fraction of a highly diverse group of inoviruses. Using a machine learning approach lever- aging a combination of marker gene and genome features, we identified 10,295 inovirus-like sequences from microbial genomes and metagenomes. Collectively, our results call for reclassification of the current Inoviridae family into a viral order including six distinct proposed families associated with nearly all bacterial phyla across virtually every ecosystem.
  • Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*

    Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*

    TOOLS AND RESOURCES Learning protein constitutive motifs from sequence data Je´ roˆ me Tubiana, Simona Cocco, Re´ mi Monasson* Laboratory of Physics of the Ecole Normale Supe´rieure, CNRS UMR 8023 & PSL Research, Paris, France Abstract Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (a-helixes and b-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ’turning up’ or ’turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. DOI: https://doi.org/10.7554/eLife.39397.001 Introduction In recent years, the sequencing of many organisms’ genomes has led to the collection of a huge number of protein sequences, which are catalogued in databases such as UniProt or PFAM Finn et al., 2014). Sequences that share a common ancestral origin, defining a family (Figure 1A), *For correspondence: are likely to code for proteins with similar functions and structures, providing a unique window into [email protected] the relationship between genotype (sequence content) and phenotype (biological features).
  • DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S

    DECIPHER: Harnessing Local Sequence Context to Improve Protein Multiple Sequence Alignment Erik S

    Wright BMC Bioinformatics (2015) 16:322 DOI 10.1186/s12859-015-0749-z RESEARCH ARTICLE Open Access DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment Erik S. Wright1,2 Abstract Background: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. Results: Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on largesequencesets. Conclusions: Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins.
  • UC Davis UC Davis Previously Published Works

    UC Davis UC Davis Previously Published Works

    UC Davis UC Davis Previously Published Works Title Longer first introns are a general property of eukaryotic gene structure. Permalink https://escholarship.org/uc/item/9j42z8fm Journal PloS one, 3(8) ISSN 1932-6203 Authors Bradnam, Keith R Korf, Ian Publication Date 2008-08-29 DOI 10.1371/journal.pone.0003093 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Longer First Introns Are a General Property of Eukaryotic Gene Structure Keith R. Bradnam*, Ian Korf Genome Center, University of California Davis, Davis, California, United States of America Abstract While many properties of eukaryotic gene structure are well characterized, differences in the form and function of introns that occur at different positions within a transcript are less well understood. In particular, the dynamics of intron length variation with respect to intron position has received relatively little attention. This study analyzes all available data on intron lengths in GenBank and finds a significant trend of increased length in first introns throughout a wide range of species. This trend was found to be even stronger when using high-confidence gene annotation data for three model organisms (Arabidopsis thaliana, Caenorhabditis elegans, and Drosophila melanogaster) which show that the first intron in the 59 UTR is - on average - significantly longer than all downstream introns within a gene. A partial explanation for increased first intron length in A. thaliana is suggested by the increased frequency of certain motifs that are present in first introns. The phenomenon of longer first introns can potentially be used to improve gene prediction software and also to detect errors in existing gene annotations.
  • Methods in and Applications of the Sequencing of Short Non-Coding Rnas" (2013)

    Methods in and Applications of the Sequencing of Short Non-Coding Rnas" (2013)

    University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2013 Methods in and Applications of the Sequencing of Short Non- Coding RNAs Paul Ryvkin University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Bioinformatics Commons, Genetics Commons, and the Molecular Biology Commons Recommended Citation Ryvkin, Paul, "Methods in and Applications of the Sequencing of Short Non-Coding RNAs" (2013). Publicly Accessible Penn Dissertations. 922. https://repository.upenn.edu/edissertations/922 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/922 For more information, please contact [email protected]. Methods in and Applications of the Sequencing of Short Non-Coding RNAs Abstract Short non-coding RNAs are important for all domains of life. With the advent of modern molecular biology their applicability to medicine has become apparent in settings ranging from diagonistic biomarkers to therapeutics and fields angingr from oncology to neurology. In addition, a critical, recent technological development is high-throughput sequencing of nucleic acids. The convergence of modern biotechnology with developments in RNA biology presents opportunities in both basic research and medical settings. Here I present two novel methods for leveraging high-throughput sequencing in the study of short non- coding RNAs, as well as a study in which they are applied to Alzheimer's Disease (AD). The computational methods presented here include High-throughput Annotation of Modified Ribonucleotides (HAMR), which enables researchers to detect post-transcriptional covalent modifications ot RNAs in a high-throughput manner. In addition, I describe Classification of RNAs by Analysis of Length (CoRAL), a computational method that allows researchers to characterize the pathways responsible for short non-coding RNA biogenesis.
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

    A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
  • Comparing Tools for Non-Coding RNA Multiple Sequence Alignment Based On

    Comparing Tools for Non-Coding RNA Multiple Sequence Alignment Based On

    Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press ES Wright 1 1 TITLE 2 RNAconTest: Comparing tools for non-coding RNA multiple sequence alignment based on 3 structural consistency 4 Running title: RNAconTest: benchmarking comparative RNA programs 5 Author: Erik S. Wright1,* 6 1 Department of Biomedical Informatics, University of Pittsburgh (Pittsburgh, PA) 7 * Corresponding author: Erik S. Wright ([email protected]) 8 Keywords: Multiple sequence alignment, Secondary structure prediction, Benchmark, non- 9 coding RNA, Consensus secondary structure 10 Downloaded from rnajournal.cshlp.org on September 26, 2021 - Published by Cold Spring Harbor Laboratory Press ES Wright 2 11 ABSTRACT 12 The importance of non-coding RNA sequences has become increasingly clear over the past 13 decade. New RNA families are often detected and analyzed using comparative methods based on 14 multiple sequence alignments. Accordingly, a number of programs have been developed for 15 aligning and deriving secondary structures from sets of RNA sequences. Yet, the best tools for 16 these tasks remain unclear because existing benchmarks contain too few sequences belonging to 17 only a small number of RNA families. RNAconTest (RNA consistency test) is a new 18 benchmarking approach relying on the observation that secondary structure is often conserved 19 across highly divergent RNA sequences from the same family. RNAconTest scores multiple 20 sequence alignments based on the level of consistency among known secondary structures 21 belonging to reference sequences in their output alignment. Similarly, consensus secondary 22 structure predictions are scored according to their agreement with one or more known structures 23 in a family.
  • Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors

    Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors

    Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors Running head: Six-fold speed-up of Smith-Waterman searches Torbjørn Rognes* and Erling Seeberg Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway Abstract Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith- Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith-Waterman sequence alignment algorithm using SIMD (Single-Instruction, Multiple-Data) technology is presented. This implementation is based on the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimised 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http://dna.uio.no/search/ Contact: [email protected] Published in Bioinformatics (2000) 16 (8), 699-706. Copyright © (2000) Oxford University Press. *) To whom correspondence should be addressed.
  • 1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

    1 Codon-Level Information Improves Predictions of Inter-Residue Contacts in Proteins 2 by Correlated Mutation Analysis 3

    1 Codon-level information improves predictions of inter-residue contacts in proteins 2 by correlated mutation analysis 3 4 5 6 7 8 Etai Jacob1,2, Ron Unger1,* and Amnon Horovitz2,* 9 10 11 1The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat- 12 Gan, 52900, 2Department of Structural Biology 13 Weizmann Institute of Science, Rehovot 7610001, Israel 14 15 16 *To whom correspondence should be addressed: 17 Amnon Horovitz ([email protected]) 18 Ron Unger ([email protected]) 19 1 20 Abstract 21 Methods for analysing correlated mutations in proteins are becoming an increasingly 22 powerful tool for predicting contacts within and between proteins. Nevertheless, 23 limitations remain due to the requirement for large multiple sequence alignments (MSA) 24 and the fact that, in general, only the relatively small number of top-ranking predictions 25 are reliable. To date, methods for analysing correlated mutations have relied exclusively 26 on amino acid MSAs as inputs. Here, we describe a new approach for analysing 27 correlated mutations that is based on combined analysis of amino acid and codon MSAs. 28 We show that a direct contact is more likely to be present when the correlation between 29 the positions is strong at the amino acid level but weak at the codon level. The 30 performance of different methods for analysing correlated mutations in predicting 31 contacts is shown to be enhanced significantly when amino acid and codon data are 32 combined. 33 2 34 The effects of mutations that disrupt protein structure and/or function at one site are often 35 suppressed by mutations that occur at other sites either in the same protein or in other 36 proteins.