HMMER User's Guide

Total Page:16

File Type:pdf, Size:1020Kb

HMMER User's Guide HMMER User’s Guide Biological sequence analysis using profile hidden Markov models Sean R. Eddy and the HMMER development team http://hmmer.org Version 3.3.2; Nov 2020 Copyright (C) 2020 Howard Hughes Medical Institute. HMMER and its documentation are freely distributed under the 3-Clause BSD open source license. For a copy of the license, see opensource.org/licenses/BSD-3-Clause. HMMER development is supported in part by the National Human Genome Research Institute of the US National Institutes of Health under grant number R01HG009116. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Contents Introduction 7 How to avoid reading this manual.......................... 7 Background and brief history............................. 8 Problems HMMER is designed for.......................... 9 HMMER uses ensemble algorithms, not optimal alignment............ 10 Assumptions and limitations of profile HMMs................... 12 How to learn more................................... 12 How to cite HMMER.................................. 12 How to report a bug.................................. 13 When’s HMMER4 coming?.............................. 13 What’s still missing................................... 15 How to avoid using this software (links to similar software)........... 15 Installation 17 Quickest: install a precompiled binary package.................. 17 Quick-ish: compile the source code.......................... 17 Geeky: compile source from our github repository................. 18 Gory details....................................... 19 System requirements............................... 19 Multicore parallelization is default....................... 21 MPI cluster parallelization is optional..................... 21 Using build directories.............................. 22 Makefile targets.................................. 22 Compiling the user guide............................ 22 What gets installed by make install, and where?................ 23 Installing both HMMER2 and HMMER3 ................... 24 Seeing more output from make .......................... 24 Staged installations in a buildroot, for a packaging system......... 25 Workarounds for unusual configure/compilation problems........ 25 Tutorial 27 Tap, tap; is this thing on?............................... 27 The programs in HMMER............................... 27 Running a HMMER program............................. 28 4 sean r. eddy Files used in the tutorial................................ 29 On sequence file formats, briefly........................... 30 Searching a sequence database with a profile.................... 30 Step 1: build a profile with hmmbuild ....................... 31 Step 2: search the sequence database with hmmsearch........... 32 Single sequence protein queries using phmmer................... 41 Iterative protein searches using jackhmmer..................... 42 Searching a profile database with a query sequence................ 44 Step 1: create a profile database file...................... 44 Step 2: compress and index the flatfile with hmmpress........... 46 Step 3: search the profile database with hmmscan.............. 46 Summary statistics for a profile database: hmmstat............. 47 Creating multiple alignments with hmmalign................... 49 Searching DNA sequences............................... 51 Step 1: build a profile with hmmbuild..................... 52 Step 2: search the DNA sequence database with nhmmer.......... 52 The HMMER profile/sequence comparison pipeline 57 Null model........................................ 58 MSV filter........................................ 59 Biased composition filter................................ 60 Viterbi filter....................................... 61 Forward filter/parser.................................. 62 Domain definition.................................... 62 Modifications to the pipeline as used for DNA search............... 65 SSV, not MSV.................................... 65 There are no domains, but there are envelopes................ 66 Biased composition................................. 66 Tabular output formats 67 The target hits table................................... 67 The domain hits table (protein search only)..................... 70 Manual pages for HMMER programs 73 alimask - calculate and add column mask to a multiple sequence alignment.. 73 hmmalign - align sequences to a profile........................ 77 hmmbuild - construct profiles from multiple sequence alignments......... 79 hmmc2 - example client for the HMMER daemon.................. 85 hmmconvert - convert profile file to various formats................. 86 hmmemit - sample sequences from a profile...................... 87 hmmfetch - retrieve profiles from a file......................... 90 hmmlogo - produce a conservation logo graphic from a profile........... 92 hmmpgmd - daemon for database search web services................. 93 hmmpgmd_shard - sharded daemon for database search web services........ 95 hmmpress - prepare a profile database for hmmscan................. 97 hmmer user’s guide 5 hmmscan - search sequence(s) against a profile database............... 98 hmmsearch - search profile(s) against a sequence database............. 103 hmmsim - collect profile score distributions on random sequences......... 108 hmmstat - summary statistics for a profile file.................... 114 jackhmmer - iteratively search sequence(s) against a sequence database...... 116 makehmmerdb - build nhmmer database from a sequence file............ 125 nhmmer - search DNA queries against a DNA sequence database......... 126 nhmmscan - search DNA sequence(s) against a DNA profile database....... 133 phmmer - search protein sequence(s) against a protein sequence database..... 138 Manual pages for Easel miniapps 145 esl-afetch - retrieve alignments from a multi-MSA database........... 145 esl-alimanip - manipulate a multiple sequence alignment............. 147 esl-alimap - map two alignments to each other................... 151 esl-alimask - remove columns from a multiple sequence alignment....... 153 esl-alimerge - merge alignments based on their reference (RF) annotation.... 159 esl-alipid - calculate pairwise percent identities for all sequence......... 161 esl-alirev - reverse complement a multiple alignment............... 162 esl-alistat - summarize a multiple sequence alignment file........... 164 esl-compalign - compare two multiple sequence alignments............ 167 esl-compstruct - calculate accuracy of RNA secondary structure predictions... 169 esl-construct - describe or create a consensus secondary structure........ 171 esl-histplot - collate data histogram, output xmgrace datafile.......... 173 esl-mask - mask sequence residues with X’s (or other characters)......... 174 esl-mixdchlet - fitting mixture Dirichlets to count data............... 176 esl-reformat - convert sequence file formats..................... 178 esl-selectn - select random subset of lines from file................ 181 esl-seqrange - determine a range of sequences for one of many parallel..... 182 esl-seqstat - summarize contents of a sequence file................ 183 esl-sfetch - retrieve (sub-)sequences from a sequence file............. 184 esl-shuffle - shuffling sequences or generating random ones........... 187 esl-ssdraw - create postscript secondary structure diagrams............ 190 esl-translate - translate DNA sequence in six frames into individual...... 201 esl-weight - calculate sequence weights in MSA(s)................. 204 Input files and formats 205 Reading from files, compressed files, and pipes.................. 205 .gz compressed files............................... 207 HMMER profile HMM files.............................. 208 header section................................... 209 main model section................................ 212 Stockholm, the recommended multiple sequence alignment format....... 214 syntax of Stockholm markup.......................... 215 semantics of Stockholm markup........................ 215 6 sean r. eddy recognized #=GF annotations.......................... 216 recognized #=GS annotations.......................... 216 recognized #=GC annotations.......................... 217 recognized #=GR annotations.......................... 217 A2M multiple alignment format........................... 219 An example A2M file............................... 219 Legal characters.................................. 220 Determining consensus columns........................ 220 hmmpgmd sequence database format........................ 221 Fields in header line............................... 221 FASTA-like sequence format........................... 221 Creating a file in hmmpgmd format...................... 222 Score matrix files.................................... 223 Acknowledgements and history 225 Introduction Most protein sequences are composed from a relatively small number of ancestral protein domain families. Our sampling of common pro- tein domain families has become comprehensive and deep, while raw sequence data continues to accumulate explosively. It has become ad- vantageous to compare sequences against all known domain families, instead of all known sequences. This makes protein sequence analysis more like speech recogni- tion. When you talk to your smartphone, it doesn’t compare your digitized speech to everything that’s ever been said. It compares what you say to a prebuilt dataset of statistical models of common words and phonemes. Using machine learning techniques,
Recommended publications
  • Create an Email with Subject Title “Embedded Software Engineer”, Email a Copy of Your Resume to [email protected]
    To Apply for This Position: Create an email with subject title “Embedded Software Engineer”, email a copy of your resume to [email protected] Location Address: ALLEN PARK, MI,48101 Position Description: TITLE: Embedded Software Engineer ‐ Hypervisor OS technologies This position is responsible to develop QNX and Android operating system images for Ford infotainment products. This includes creating and integrating code for: bootloader, kernel, drivers, type 1 hypervisor, and build environment. Skills Required: • Lead the design, bring‐up and support of QNX and Android operating system images • Create virt‐io drivers for QNX or Android guest operating systems • Participate in root cause analysis of hardware quality problems and software defects • Participate in system design, documentation, and testing to deliver a best‐in‐class infotainment system Experience Required: • 5+ years operating system experience involving Linux or QNX • 5+ years C/C++ software development experience on embedded, mobile, or consumer electronic platforms Experience Preferred: • Experience with Type 1 hypervisors • Experience creating virt‐io drivers • Mastery of C/C++ language, GNU tool chain, and Unix (QNX, Linux, or equivalent) • Experience with embedded build systems including QNX system builder, buildroot, yocto, or equivalent • Knowledge of in‐vehicle signaling and communication mechanisms such as CAN • Proficiency with revision control including Git, Subversion, or equivalent • Multi‐site software project team experience Education Required: • Bachelor's degree in Computer Engineering, Electrical Engineering, Computer Science, or related Education Preferred: • Master's degree in Computer Engineering, Electrical Engineering or Computer Science Additional Information: Web Based Assessment not required for this position. Visa Sponsorship and Domestic Relocation is available for this position.
    [Show full text]
  • Sequencing Alignment I Outline: Sequence Alignment
    Sequencing Alignment I Lectures 16 – Nov 21, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022 1 Outline: Sequence Alignment What Why (applications) Comparative genomics DNA sequencing A simple algorithm Complexity analysis A better algorithm: “Dynamic programming” 2 1 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G T C C A A T 3 Sequence Alignment: What Definition An arrangement of two or several biological sequences (e.g. protein or DNA sequences) highlighting their similarity The sequences are padded with gaps (usually denoted by dashes) so that columns contain identical or similar characters from the sequences involved Example – pairwise alignment T A C T A A G | : | : | | : T C C – A A T 4 2 Sequence Alignment: Why The most basic sequence analysis task First aligning the sequences (or parts of them) and Then deciding whether that alignment is more likely to have occurred because the sequences are related, or just by chance Similar sequences often have similar origin or function New sequence always compared to existing sequences (e.g. using BLAST) 5 Sequence Alignment Example: gene HBB Product: hemoglobin Sickle-cell anaemia causing gene Protein sequence (146 aa) MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAHGKKVLG AFSDGLAHLD NLKGTFATLS ELHCDKLHVD PENFRLLGNV LVCVLAHHFG KEFTPPVQAA YQKVVAGVAN ALAHKYH BLAST (Basic Local Alignment Search Tool) The most popular alignment tool Try it! Pick any protein, e.g.
    [Show full text]
  • Comparative Analysis of Multiple Sequence Alignment Tools
    I.J. Information Technology and Computer Science, 2018, 8, 24-30 Published Online August 2018 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2018.08.04 Comparative Analysis of Multiple Sequence Alignment Tools Eman M. Mohamed Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected]. Hamdy M. Mousa, Arabi E. keshk Faculty of Computers and Information, Menoufia University, Egypt E-mail: [email protected], [email protected]. Received: 24 April 2018; Accepted: 07 July 2018; Published: 08 August 2018 Abstract—The perfect alignment between three or more global alignment algorithm built-in dynamic sequences of Protein, RNA or DNA is a very difficult programming technique [1]. This algorithm maximizes task in bioinformatics. There are many techniques for the number of amino acid matches and minimizes the alignment multiple sequences. Many techniques number of required gaps to finds globally optimal maximize speed and do not concern with the accuracy of alignment. Local alignments are more useful for aligning the resulting alignment. Likewise, many techniques sub-regions of the sequences, whereas local alignment maximize accuracy and do not concern with the speed. maximizes sub-regions similarity alignment. One of the Reducing memory and execution time requirements and most known of Local alignment is Smith-Waterman increasing the accuracy of multiple sequence alignment algorithm [2]. on large-scale datasets are the vital goal of any technique. The paper introduces the comparative analysis of the Table 1. Pairwise vs. multiple sequence alignment most well-known programs (CLUSTAL-OMEGA, PSA MSA MAFFT, BROBCONS, KALIGN, RETALIGN, and Compare two biological Compare more than two MUSCLE).
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • OS Selection for Dummies
    OS SELECTION HOW TO CHOOSE HOW TO CHOOSE Choosing your OS is the first step, so take the time to consider your choice fully. There are many parameters to take into account: l Is this a new project or the evolution of an existing product? l Using the same SW stack? Re-using existing code? l Is your team familiar with a particular OS? Ø Using an OS you are already comfortable with can help l What are the HW constraints of your system? Ø Some operating systems require more memory/processing power than others l Have no SW team? Not sure about the above? Ø Contact us so we can help you decide! Ø We can also introduce you to one of our many partners! 1 OS SELECTION OPEN SOURCE VS. COMMERCIAL OS Embedded OS BSP Provider $ Cost Open-Source OS Boundary Devices • Embedded Linux / Android Embedded Linux $0, included • Large pool of developers available with Board Purchase • Strong community • Royalty-free And / or partners 3rd Party - Commercial OS Partners • QNX / Win10 IoT / Green Hills $>0, depends on • Professional support requirements • Unique set of development tools 2 OS SELECTION OPEN SOURCE SELECTION OS SELECTION PROS CONS Embedded Linux Most powerful / optimized Complexity for newcomers solution, maintained by NXP • Build systems Ø Yocto / Buildroot Simpler solution, makefile- Not as flexible as Yocto Ø Everything built from scratch based, maintained by BD Desktop-like approach, Harder to customize, non- Package-based distribution easy-to-use atomic updates, no cross- • Ubuntu / Debian compilation SDK Apt install / update, millions • Packages installed from server of prebuilt packages available Android Millions of apps available, same number of developers, Resource-hungry, complex • AOSP-based (no GMS) development environment, BSP modifications (HAL) • APK applications IDE + debugging tools 3 SOFTWARE PARTNERS Boundary Devices has an industry-leading group of software partners.
    [Show full text]
  • To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari
    UNIT 1.11 Using The Arabidopsis Information Resource (TAIR) to Find Information About Arabidopsis Genes Leonore Reiser1, Shabari Subramaniam1, Donghui Li1, and Eva Huala1 1Phoenix Bioinformatics, Redwood City, CA USA ABSTRACT The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource of Arabidopsis biology for plant scientists. TAIR curates and integrates information about genes, proteins, gene function, orthologs gene expression, mutant phenotypes, biological materials such as clones and seed stocks, genetic markers, genetic and physical maps, genome organization, images of mutant plants, protein sub-cellular localizations, publications, and the research community. The various data types are extensively interconnected and can be accessed through a variety of Web-based search and display tools. This unit primarily focuses on some basic methods for searching, browsing, visualizing, and analyzing information about Arabidopsis genes and genome, Additionally we describe how members of the community can share data using TAIR’s Online Annotation Submission Tool (TOAST), in order to make their published research more accessible and visible. Keywords: Arabidopsis ● databases ● bioinformatics ● data mining ● genomics INTRODUCTION The Arabidopsis Information Resource (TAIR; http://arabidopsis.org) is a comprehensive Web resource for the biology of Arabidopsis thaliana (Huala et al., 2001; Garcia-Hernandez et al., 2002; Rhee et al., 2003; Weems et al., 2004; Swarbreck et al., 2008, Lamesch, et al., 2010, Berardini et al., 2016). The TAIR database contains information about genes, proteins, gene expression, mutant phenotypes, germplasms, clones, genetic markers, genetic and physical maps, genome organization, publications, and the research community. In addition, seed and DNA stocks from the Arabidopsis Biological Resource Center (ABRC; Scholl et al., 2003) are integrated with genomic data, and can be ordered through TAIR.
    [Show full text]
  • Curriculum Vitae – Prof. Anders Krogh Personal Information
    Curriculum Vitae – Prof. Anders Krogh Personal Information Date of Birth: May 2nd, 1959 Private Address: Borgmester Jensens Alle 22, st th, 2100 København Ø, Denmark Contact information: Dept. of Biology, Univ. of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen, Denmark. +45 3532 1329, [email protected] Web: https://scholar.google.com/citations?user=-vGMjmwAAAAJ Education Sept 1991 Ph.D. (Physics), Niels Bohr Institute, Univ. of Copenhagen, Denmark June 1987 Cand. Scient. [M. Sc.] (Physics and mathematics), NBI, Univ. of Copenhagen Professional / Work Experience (since 2000) 2018 – Professor of Bionformatics, Dept of Computer Science (50%) and Dept of Biology (50%), Univ. of Copenhagen 2002 – 2018 Professor of Bionformatics, Dept of Biology, Univ. of Copenhagen 2009 – 2018 Head of Section for Computational and RNA Biology, Dept. of Biology, Univ. of Copenhagen 2000–2002 Associate Prof., Technical Univ. of Denmark (DTU), Copenhagen Prices and Awards 2017 – Fellow of the International Society for Computational Biology https://www.iscb.org/iscb- fellows-program 2008 – Fellow, Royal Danish Academy of Sciences and Letters Public Activities & Appointments (since 2009) 2014 – Board member, Elixir, European Infrastructure for Life Science. 2014 – Steering committee member, Danish Elixir Node. 2012 – 2016 Board member, Bioinformatics Infrastructure for Life Sciences (BILS), Swedish Research Council 2011 – 2016 Director, Centre for Computational and Applied Transcriptomics (COAT) 2009 – Associate editor, BMC Bioinformatics Publications § Google Scholar: https://scholar.google.com/citations?user=-vGMjmwAAAAJ § ORCID: 0000-0002-5147-6282. ResearcherID: M-1541-2014 § Co-author of 130 peer-reviewed papers and 2 monographs § 63,000 citations and h-index of 74 (Google Scholar, June 2019) § H-index of 54 in Web of science (June 2019) § Publications in high-impact journals: Nature (5), Science (1), Cell (1), Nature Genetics (2), Nature Biotechnology (2), Nature Communications (4), Cell (1, to appear), Genome Res.
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • High Definition Analyses of Single Cohort, Whole Genome Sequencing Data Provides a Direct Route
    medRxiv preprint doi: https://doi.org/10.1101/2021.08.28.21262560; this version posted September 1, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission. High definition analyses of single cohort, whole genome sequencing data provides a direct route to defining sub-phenotypes and personalising medicine Joyce KE1,2, Onabanjo E3, Brownlow S3, Nur F3, Olupona KO3, Fakayode K3, Sroya M4, Thomas G4, Ferguson T3, Redhead J3, Millar CM3,5, Cooper N3,5, Layton DM3,5, Boardman-Pretty F6, Caulfield MJ6,7, Genomics England Research Consortium6, Shovlin CL2,3,8* 1Imperial College School of Medicine, Imperial College, London UK; 2Genomics England Respiratory Clinical Interpretation Partnership (GeCIP); 3West London Genomic Medicine Centre, Imperial College Healthcare NHS Trust, London UK; 4Department of Surgery and Cancer, Imperial College London, UK; 5Centre for Haematology, Department of Immunology and Inflammation, Imperial College London UK; 6Genomics England, UK; 7 William Harvey Research Institute, Queen Mary University of London, London UK; 8National Heart and Lung Institute, Imperial College London UK. Word Count 4778 Abstract 150 Figures – 5 Data Supplement File- 1 *Corresponding Author: Claire L. Shovlin PhD FRCP, National Heart and Lung Institute, Imperial Centre for Translational and Experimental Medicine, Imperial College London, Hammersmith Campus, Du Cane Road, London W12 0NN, UK. Email [email protected] NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • HMMER User's Guide
    HMMER User's Guide Biological sequence analysis using pro®le hidden Markov models http://hmmer.wustl.edu/ Version 2.1.1; December 1998 Sean Eddy Dept. of Genetics, Washington University School of Medicine 4566 Scott Ave., St. Louis, MO 63110, USA [email protected] With contributions by Ewan Birney ([email protected]) Copyright (C) 1992-1998, Washington University in St. Louis. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies. The HMMER software package is a copyrighted work that may be freely distributed and modi®ed under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. Some versions of HMMER may have been obtained under specialized commercial licenses from Washington University; for details, see the ®les COPYING and LICENSE that came with your copy of the HMMER software. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Appendix for a copy of the full text of the GNU General Public License. 1 Contents 1 Tutorial 5 1.1 The programs in HMMER . 5 1.2 Files used in the tutorial . 6 1.3 Searching a sequence database with a single pro®le HMM . 6 HMM construction with hmmbuild . 7 HMM calibration with hmmcalibrate . 7 Sequence database search with hmmsearch . 8 Searching major databases like NR or SWISSPROT .
    [Show full text]
  • Apply Parallel Bioinformatics Applications on Linux PC Clusters
    Tunghai Science Vol. : 125−141 125 July, 2003 Apply Parallel Bioinformatics Applications on Linux PC Clusters Yu-Lun Kuo and Chao-Tung Yang* Abstract In addition to the traditional massively parallel computers, distributed workstation clusters now play an important role in scientific computing perhaps due to the advent of commodity high performance processors, low-latency/high-band width networks and powerful development tools. As we know, bioinformatics tools can speed up the analysis of large-scale sequence data, especially about sequence alignment. To fully utilize the relatively inexpensive CPU cycles available to today’s scientists, a PC cluster consists of one master node and seven slave nodes (16 processors totally), is proposed and built for bioinformatics applications. We use the mpiBLAST and HMMer on parallel computer to speed up the process for sequence alignment. The mpiBLAST software uses a message-passing library called MPI (Message Passing Interface) and the HMMer software uses a software package called PVM (Parallel Virtual Machine), respectively. The system architecture and performances of the cluster are also presented in this paper. Keywords: Parallel computing, Bioinformatics, BLAST, HMMer, PC Clusters, Speedup. 1. Introduction Extraordinary technological improvements over the past few years in areas such as microprocessors, memory, buses, networks, and software have made it possible to assemble groups of inexpensive personal computers and/or workstations into a cost effective system that functions in concert and posses tremendous processing power. Cluster computing is not new, but in company with other technical capabilities, particularly in the area of networking, this class of machines is becoming a high-performance platform for parallel and distributed applications [1, 2, 11, 12, 13, 14, 15, 16, 17].
    [Show full text]