The Universal Protein Resource (Uniprot) Amos Bairoch, Rolf Apweiler1,*, Cathy H

Total Page:16

File Type:pdf, Size:1020Kb

The Universal Protein Resource (Uniprot) Amos Bairoch, Rolf Apweiler1,*, Cathy H D154–D159 Nucleic Acids Research, 2005, Vol. 33, Database issue doi:10.1093/nar/gki070 The Universal Protein Resource (UniProt) Amos Bairoch, Rolf Apweiler1,*, Cathy H. Wu2, Winona C. Barker3, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang2, Rodrigo Lopez1, Michele Magrane1, Maria J. Martin1, Darren A. Natale2, Claire O’Donovan1, Nicole Redaschi and Lai-Su L. Yeh3 Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland, 1The EMBL Outstation—The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2Department of Biochemistry and Molecular Biology and 3National Biomedical Research Foundation, Georgetown University Medical Center, 3900 Reservoir Road, NW, Box 571414, WA 20057-1414, USA Received September 14, 2004; Revised and Accepted October 5, 2004 ABSTRACT residue level mapping of sequences from the The Universal Protein Resource (UniProt) provides Protein Data Bank entries onto corresponding the scientific community with a single, centralized, UniProt entries. For convenient sequence searches authoritative resource for protein sequences and we provide the UniRef non-redundant sequence functional information. Formed by uniting the databases. The comprehensive UniParc database Swiss-Prot, TrEMBL and PIR protein database activ- stores the complete body of publicly available ities, the UniProt consortium produces three layers of protein sequence data. The UniProt databases can protein sequence databases: the UniProt Archive be accessed online (http://www.uniprot.org) or down- (UniParc), the UniProt Knowledgebase (UniProt) and loaded in several formats (ftp://ftp.uniprot.org/pub). the UniProt Reference (UniRef) databases. The New releases are published every two weeks. UniProt Knowledgebase is a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase with extensive cross- INTRODUCTION references. This centrepiece consists of two Previously, Swiss-Prot + TrEMBL (1) and PIR-PSD (2) coex- sections: UniProt/Swiss-Prot, with fully, manually isted as protein databases with differing sequence coverage curated entries; and UniProt/TrEMBL, enriched with and annotation priorities. In 2002, the Swiss-Prot + TrEMBL automated classification and annotation. During groups at the SIB (Swiss Institute of Bioinformatics) and 2004, tens of thousands of Knowledgebase records EBI (European Bioinformatics Institute) and the PIR (Protein got manually annotated or updated; we introduced a Information Resource) group at Georgetown University new comment line topic: TOXIC DOSE to store infor- Medical Center and National Biomedical Research Founda- tion joined forces as the UniProt consortium (3). mation on the acute toxicity of a toxin; the UniProt The UniProt consortium maintains three database layers: keyword list got augmented by additional keywords; we improved the documentation of the keywords and (i) The UniProt Archive (UniParc) provides a stable, com- are continuously overhauling and standardizing prehensive, non-redundant sequence collection by storing the annotation of post-translational modifications. the complete body of publicly available protein sequence Furthermore, we introduced a new documentation data. (ii) The UniProt Knowledgebase (UniProt) provides the file of the strains and their synonyms. Many new central database of protein sequences with accurate, database cross-references were introduced and we consistent and rich sequence and functional annotation. started to make use of Digital Object Identifiers. (iii) The UniProt Reference (UniRef) databases provide We also achieved in collaboration with the non-redundant data collections based on the UniProt Macromolecular Structure Database group at EBI an Knowledgebase and UniParc in order to obtain complete improved integration with structural databases by coverage of sequence space at several resolutions. *To whom correspondence should be addressed: Tel: +44 0 1223 494435; Fax: +44 0 1223 494468; Email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact [email protected]. ª 2005, the authors Nucleic Acids Research, Vol. 33, Database issue ª Oxford University Press 2005; all rights reserved Nucleic Acids Research, 2005, Vol. 33, Database issue D155 THE UniProt ARCHIVE information), we attach other annotation information both manually and automatically. Although most protein sequence data are derived from the Manual annotation is performed by biologists and is based translation of DDBJ/EMBL/GenBank (4) sequences, primary on literature curation and sequence analysis. The annotation protein sequence data are also submitted directly to UniProt principles were described in detail previously (3,12). During or appear in patent applications or in entries from the Protein 2004, tens of thousands of records were manually annotated or Data Bank (PDB) (5). The UniParc (6) is designed to capture updated. We also have introduced a new comment (CC) line all available protein sequence data—not just from the topic: TOXIC DOSE. This topic is used to store information aforementioned databases, but also from sources such as on the poisoning potential (acute toxicity) of a toxin. Generally Ensembl (7), the International Protein Index (IPI) (8), RefSeq this topic holds information on the LD50 and PD50. LD stands (9), FlyBase (10) and WormBase (11). This combination of for ‘Lethal Dose’. LD50 is the amount of a toxin, given all at sources makes UniParc the most comprehensive publicly once, which causes the death of 50% (one-half) of a group of accessible, non-redundant protein sequence database test animals. PD50 stands for ‘Paralytic dose’. It is the amount available. of a toxin, which causes the paralysis of 50% of a group of UniParc represents each protein sequence once and only test animals. once, assigning it a unique UniParc identifier. The UniParc Examples: release 2.6 from September 2004 contained 4 375 775 unique sequences from 11 978 094 original source records. UniParc CC -!- TOXIC DOSE: PD50 is 1.72 mg/kg by injection in cross-references the accession numbers of the source data- blowfly larvae. bases, using flags to indicate the status of the entry in the CC -!- TOXIC DOSE: LD50 is 0.015 mg/kg by intravenous original source database, with ‘active’ indicating that the injection for sarafotoxin-A and sarafotoxin-B, and 0.3 mg/kg entry is still present in the source database and ‘obsolete’ for sarafotoxin-C. indicating that the entry no longer exists in the source data- base. A UniParc sequence version is incremented each time Automatic classification and annotation. Much progress the underlying sequence changes, making it possible to was made during 2004 in our attempt to provide automatic observe sequence changes in all source databases. A sample large-scale functional characterization and annotation, which UniParc report can be found at http://www.uniprot.org/entry/ is generated with limited human interaction. UPI0000000C37. UniParc records carry no annotation, but this information can be found in the UniProt Knowledgebase InterPro classification. We use InterPro (13) to recognize or other underlying databases. domains and to classify all the protein sequences in UniProt into families and superfamilies. InterPro is an integrated resource of protein families, domains and sites that amal- THE UniProt KNOWLEDGEBASE gamates the efforts of the member databases: Pfam (14), PROSITE (15), PRINTS (16), ProDom (17), SMART (18), The UniProt Knowledgebase merges Swiss-Prot, TrEMBL PIRSF (19), Superfamily (20) and TIGRFAMs (21). Approxi- and PIR-PSD to provide a central database of protein mately 80% of all UniProt Knowledgebase records are sequences with annotations and functional information. All classified according to their InterPro domains and familes. suitable PIR-PSD sequences missing from Swiss-Prot + TrEMBL were incorporated into UniProt and bi-directional Automatic functional annotation of UniProt/TrEMBL. For cross-references were created to allow the easy tracking of PIR- automatic annotation, systems for standardized transfer of PSD entries. The transfer into UniProt of references and annotation from well-characterized proteins in the UniProt/ experimentally verified data present in PIR but missing Swiss-Prot to non-annotated UniProt/TrEMBL entries have from Swiss-Prot + TrEMBL is ongoing. been implemented. RuleBase (22) uses a semi-automatic The UniProt Knowledgebase has two parts: a section of approach, while the Spearmint approach is completely auto- fully, manually annotated records resulting from literature mated and is based on decision trees (23). InterPro is then used information extraction and curator-evaluated computational to assign UniProt entries into groups. The annotation shared by analysis, and a section with computationally analysed records the functionally characterized UniProt/Swiss-Prot proteins of awaiting full manual
Recommended publications
  • Uniprot at EMBL-EBI's Role in CTTV
    Barbara P. Palka, Daniel Gonzalez, Edd Turner, Xavier Watkins, Maria J. Martin, Claire O’Donovan European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK UniProt at EMBL-EBI’s role in CTTV: contributing to improved disease knowledge Introduction The mission of UniProt is to provide the scientific community with a The Centre for Therapeutic Target Validation (CTTV) comprehensive, high quality and freely accessible resource of launched in Dec 2015 a new web platform for life- protein sequence and functional information. science researchers that helps them identify The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of therapeutic targets for new and repurposed medicines. functional information on proteins, with accurate, consistent and rich CTTV is a public-private initiative to generate evidence on the annotation. As much annotation information as possible is added to each validity of therapeutic targets based on genome-scale experiments UniProtKB record and this includes widely accepted biological ontologies, and analysis. CTTV is working to create an R&D framework that classifications and cross-references, and clear indications of the quality of applies to a wide range of human diseases, and is committed to annotation in the form of evidence attribution of experimental and sharing its data openly with the scientific community. CTTV brings computational data. together expertise from four complementary institutions: GSK, Biogen, EMBL-EBI and Wellcome Trust Sanger Institute. UniProt’s disease expert curation Q5VWK5 (IL23R_HUMAN) This section provides information on the disease(s) associated with genetic variations in a given protein. The information is extracted from the scientific literature and diseases that are also described in the OMIM database are represented with a controlled vocabulary.
    [Show full text]
  • Distinct Visual Pathways Mediatedrosophilalarval Light
    The Journal of Neuroscience, April 27, 2011 • 31(17):6527–6534 • 6527 Behavioral/Systems/Cognitive Distinct Visual Pathways Mediate Drosophila Larval Light Avoidance and Circadian Clock Entrainment Alex C. Keene,1 Esteban O. Mazzoni,1 Jamie Zhen,1 Meg A. Younger,1 Satoko Yamaguchi,1 Justin Blau,1 Claude Desplan,1 and Simon G. Sprecher1,2 1Department of Biology, Center for Developmental Genetics, New York University, New York, New York 10003-6688, and 2Department of Biology, Institute of Cell and Developmental Biology, University of Fribourg, 1700 Fribourg, Switzerland Visual organs perceive environmental stimuli required for rapid initiation of behaviors and can also entrain the circadian clock. The larval eye of Drosophila is capable of both functions. Each eye contains only 12 photoreceptors (PRs), which can be subdivided into two subtypes. Four PRs express blue-sensitive rhodopsin5 (rh5) and eight express green-sensitive rhodopsin6 (rh6). We found that either PR-subtype is sufficient to entrain the molecular clock by light, while only the Rh5-PR subtype is essential for light avoidance. Acetylcholine released from PRs confers both functions. Both subtypes of larval PRs innervate the main circadian pacemaker neurons of the larva, the neuropeptide PDF (pigment-dispersing factor)-expressing lateral neurons (LNs), providing sensory input to control circadian rhythms. How- ever, we show that PDF-expressing LNs are dispensable for light avoidance, and a distinct set of three clock neurons is required. Thus we have identifieddistinctsensoryandcentralcircuitryregulatinglightavoidancebehaviorandclockentrainment.Ourfindingsprovideinsightsintothe coding of sensory information for distinct behavioral functions and the underlying molecular and neuronal circuitry. Introduction sin6 (rh6) (Sprecher et al., 2007; Sprecher and Desplan, 2008).
    [Show full text]
  • Article Reference
    Article Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines BABIC, Zeljana, et al. Abstract The use of misidentified and contaminated cell lines continues to be a problem in biomedical research. Research Resource Identifiers (RRIDs) should reduce the prevalence of misidentified and contaminated cell lines in the literature by alerting researchers to cell lines that are on the list of problematic cell lines, which is maintained by the International Cell Line Authentication Committee (ICLAC) and the Cellosaurus database. To test this assertion, we text-mined the methods sections of about two million papers in PubMed Central, identifying 305,161 unique cell-line names in 150,459 articles. We estimate that 8.6% of these cell lines were on the list of problematic cell lines, whereas only 3.3% of the cell lines in the 634 papers that included RRIDs were on the problematic list. This suggests that the use of RRIDs is associated with a lower reported use of problematic cell lines. Reference BABIC, Zeljana, et al. Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines. eLife, 2019, vol. 8, p. e41676 DOI : 10.7554/eLife.41676 PMID : 30693867 Available at: http://archive-ouverte.unige.ch/unige:119832 Disclaimer: layout of this document may differ from the published version. 1 / 1 FEATURE ARTICLE META-RESEARCH Incidences of problematic cell lines are lower in papers that use RRIDs to identify cell lines Abstract The use of misidentified and contaminated cell lines continues to be a problem in biomedical research.
    [Show full text]
  • Genome Analysis and Knowledge
    Dahary et al. BMC Medical Genomics (2019) 12:200 https://doi.org/10.1186/s12920-019-0647-8 SOFTWARE Open Access Genome analysis and knowledge-driven variant interpretation with TGex Dvir Dahary1*, Yaron Golan1, Yaron Mazor1, Ofer Zelig1, Ruth Barshir2, Michal Twik2, Tsippi Iny Stein2, Guy Rosner3,4, Revital Kariv3,4, Fei Chen5, Qiang Zhang5, Yiping Shen5,6,7, Marilyn Safran2, Doron Lancet2* and Simon Fishilevich2* Abstract Background: The clinical genetics revolution ushers in great opportunities, accompanied by significant challenges. The fundamental mission in clinical genetics is to analyze genomes, and to identify the most relevant genetic variations underlying a patient’s phenotypes and symptoms. The adoption of Whole Genome Sequencing requires novel capacities for interpretation of non-coding variants. Results: We present TGex, the Translational Genomics expert, a novel genome variation analysis and interpretation platform, with remarkable exome analysis capacities and a pioneering approach of non-coding variants interpretation. TGex’s main strength is combining state-of-the-art variant filtering with knowledge-driven analysis made possible by VarElect, our highly effective gene-phenotype interpretation tool. VarElect leverages the widely used GeneCards knowledgebase, which integrates information from > 150 automatically-mined data sources. Access to such a comprehensive data compendium also facilitates TGex’s broad variant annotation, supporting evidence exploration, and decision making. TGex has an interactive, user-friendly, and easy adaptive interface, ACMG compliance, and an automated reporting system. Beyond comprehensive whole exome sequence capabilities, TGex encompasses innovative non-coding variants interpretation, towards the goal of maximal exploitation of whole genome sequence analyses in the clinical genetics practice. This is enabled by GeneCards’ recently developed GeneHancer, a novel integrative and fully annotated database of human enhancers and promoters.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • PIR Brochure
    Protein Information Resource Integrated Protein Informatics Resource for Genomic & Proteomic Research For four decades the Protein Information Resource (PIR) has provided databases and protein sequence analysis tools to the scientific community, including the Protein Sequence Database, which grew out from the Atlas of Protein Sequence and Structure, edited by Margaret Dayhoff [1965-1978]. Currently, PIR major activities include: i) UniProt (Universal Protein Resource) development, ii) iProClass protein data integration and ID mapping, iii) PIRSF protein pir.georgetown.edu classification, and iv) iProLINK protein literature mining and ontology development. UniProt – Universal Protein Resource What is UniProt? UniProt is the central resource for storing and UniProt (Universal Protein Resource) http://www.uniprot.org interconnecting information from large and = + + disparate sources and the most UniProt: the world's most comprehensive catalog of information on proteins comprehensive catalog of protein sequence and functional annotation. UniProt Knowledgebase UniProt Reference UniProt Archive (UniProtKB) Clusters (UniRef) (UniParc) When to use UniProt databases Integration of Swiss-Prot, TrEMBL Non-redundant reference A stable, and PIR-PSD sequences clustered from comprehensive Use UniProtKB to retrieve curated, reliable, Fully classified, richly and accurately UniProtKB and UniParc for archive of all publicly annotated protein sequences with comprehensive or fast available protein comprehensive information on proteins. minimal redundancy and extensive sequence searches at 100%, sequences for Use UniRef to decrease redundancy and cross-references 90%, or 50% identity sequence tracking from: speed up sequence similarity searches. TrEMBL section UniRef100 Swiss-Prot, Computer-annotated protein sequences TrEMBL, PIR-PSD, Use UniParc to access to archived sequences EMBL, Ensembl, IPI, and their source databases.
    [Show full text]
  • Metabolic Inactivation of the Circadian Transmitter, Pigment Dispersing Factor (PDF), by Neprilysin-Like Peptidases in Drosophila R
    4465 The Journal of Experimental Biology 210, 4465-4470 Published by The Company of Biologists 2007 doi:10.1242/jeb.012088 Metabolic inactivation of the circadian transmitter, pigment dispersing factor (PDF), by neprilysin-like peptidases in Drosophila R. Elwyn Isaac1,*, Erik C. Johnson2, Neil Audsley3 and Alan D. Shirras4 1Faculty of Biological Sciences, University of Leeds, Leeds, LS2 9JT, UK, 2Department of Biology, Wake Forest University, NC, USA, 3Central Science Laboratory, Sand Hutton, York, YO41 1LZ, UK and 4Department of Biological Sciences, University of Lancaster, LA1 4YQ, UK *Author for correspondence (e-mail: [email protected]) Accepted 4 October 2007 Summary Recent studies have firmly established pigment confirming that such cleavage results in PDF inactivation. dispersing factor (PDF), a C-terminally amidated The Ser7–Leu8 peptide bond was also the principal octodecapeptide, as a key neurotransmitter regulating cleavage site when PDF was incubated with membranes rhythmic circadian locomotory behaviours in adult prepared from heads of adult Drosophila. This Drosophila melanogaster. The mechanisms by which PDF endopeptidase activity was inhibited by the neprilysin –1 functions as a circadian peptide transmitter are not fully inhibitors phosphoramidon (IC50, 0.15·␮mol·l ) and –1 understood, however; in particular, nothing is known thiorphan (IC50, 1.2·␮mol·l ). We propose that cleavage by about the role of extracellular peptidases in terminating a member of the Drosophila neprilysin family of PDF signalling at synapses. In this study we show that PDF endopeptidases is the most likely mechanism for is susceptible to hydrolysis by neprilysin, an endopeptidase inactivating synaptic PDF and that neprilysin might have that is enriched in synaptic membranes of mammals and an important role in regulating PDF signals within insects.
    [Show full text]
  • Loss of BCL-3 Sensitises Colorectal Cancer Cells to DNA Damage, Revealing A
    bioRxiv preprint doi: https://doi.org/10.1101/2021.08.03.454995; this version posted August 4, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Title: Loss of BCL-3 sensitises colorectal cancer cells to DNA damage, revealing a role for BCL-3 in double strand break repair by homologous recombination Authors: Christopher Parker*1, Adam C Chambers*1, Dustin Flanagan2, Tracey J Collard1, Greg Ngo3, Duncan M Baird3, Penny Timms1, Rhys G Morgan4, Owen Sansom2 and Ann C Williams1. *Joint first authors. Author affiliations: 1. Colorectal Tumour Biology Group, School of Cellular and Molecular Medicine, Faculty of Life Sciences, Biomedical Sciences Building, University Walk, University of Bristol, Bristol, BS8 1TD, UK 2. Cancer Research UK Beatson Institute, Garscube Estate, Switchback Road, Bearsden Glasgow, G61 1BD UK 3. Division of Cancer and Genetics, School of Medicine, Cardiff University, Cardiff, CF14 4XN UK 4. School of Life Sciences, University of Sussex, Sussex House, Falmer, Brighton, BN1 9RH UK 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.08.03.454995; this version posted August 4, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract (250 words) Objective: The proto-oncogene BCL-3 is upregulated in a subset of colorectal cancers (CRC) and increased expression of the gene correlates with poor patient prognosis. The aim is to investigate whether inhibiting BCL-3 can increase the response to DNA damage in CRC.
    [Show full text]
  • Update on Genome Completion and Annotations
    UPDATE ON GENOME COMPLETION AND ANNOTATIONS Update on genome completion and annotations: Protein Information Resource Cathy Wu1and Daniel W. Nebert2* 1Director of PIR, Department of Biochemistry and Molecular Biology, Georgetown University Medical Center, Washington, DC, USA 2Department of Environmental Health and Center for Environmental Genetics (CEG), University of Cincinnati Medical Center, Cincinnati, OH 45267–0056, USA *Correspondence to: Tel: þ1 513 558 4347; Fax: þ1 513 558 3562; E-mail: [email protected] Date received (in revised form): 11th January 2004 Abstract The Protein Information Resource (PIR) recently joined the European Bioinformatics Institute (EBI) and Swiss Institute of Bioinformatics (SIB) to establish UniProt — the Universal Protein Resource — which now unifies the PIR, Swiss-Prot and TrEMBL databases. The PIRSF (SuperFamily) classification system is central to the PIR/UniProt functional annotation of proteins, providing classifications of whole proteins into a network structure to reflect their evolutionary relationships. Data integration and associative studies of protein family, function and structure are supported by the iProClass database, which offers value-added descriptions of all UniProt proteins with highly informative links to more than 50 other databases. The PIR system allows consistent, rich and accurate protein annotation for all investigators. Keywords: protein web sites, protein family, functional annotation Introduction system.2 This framework is supported by the iProClass integrated database of protein family, function and structure.3 The high-throughput genome projects have resulted in a rapid iProClass offers value-added descriptions of all UniProt accumulation of genome sequences for a large number of proteins and has highly informative links to more than 50 organisms.
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • Webnetcoffee
    Hu et al. BMC Bioinformatics (2018) 19:422 https://doi.org/10.1186/s12859-018-2443-4 SOFTWARE Open Access WebNetCoffee: a web-based application to identify functionally conserved proteins from Multiple PPI networks Jialu Hu1,2, Yiqun Gao1, Junhao He1, Yan Zheng1 and Xuequn Shang1* Abstract Background: The discovery of functionally conserved proteins is a tough and important task in system biology. Global network alignment provides a systematic framework to search for these proteins from multiple protein-protein interaction (PPI) networks. Although there exist many web servers for network alignment, no one allows to perform global multiple network alignment tasks on users’ test datasets. Results: Here, we developed a web server WebNetcoffee based on the algorithm of NetCoffee to search for a global network alignment from multiple networks. To build a series of online test datasets, we manually collected 218,339 proteins, 4,009,541 interactions and many other associated protein annotations from several public databases. All these datasets and alignment results are available for download, which can support users to perform algorithm comparison and downstream analyses. Conclusion: WebNetCoffee provides a versatile, interactive and user-friendly interface for easily running alignment tasks on both online datasets and users’ test datasets, managing submitted jobs and visualizing the alignment results through a web browser. Additionally, our web server also facilitates graphical visualization of induced subnetworks for a given protein and its neighborhood. To the best of our knowledge, it is the first web server that facilitates the performing of global alignment for multiple PPI networks. Availability: http://www.nwpu-bioinformatics.com/WebNetCoffee Keywords: Multiple network alignment, Webserver, PPI networks, Protein databases, Gene ontology Background tools [7–10] have been developed to understand molec- Proteins are involved in almost all life processes.
    [Show full text]
  • Six-Fold Speed-Up of Smith-Waterman Sequence Database Searches Using Parallel Processing on Common Microprocessors
    Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors Running head: Six-fold speed-up of Smith-Waterman searches Torbjørn Rognes* and Erling Seeberg Institute of Medical Microbiology, University of Oslo, The National Hospital, NO-0027 Oslo, Norway Abstract Motivation: Sequence database searching is among the most important and challenging tasks in bioinformatics. The ultimate choice of sequence search algorithm is that of Smith- Waterman. However, because of the computationally demanding nature of this method, heuristic programs or special-purpose hardware alternatives have been developed. Increased speed has been obtained at the cost of reduced sensitivity or very expensive hardware. Results: A fast implementation of the Smith-Waterman sequence alignment algorithm using SIMD (Single-Instruction, Multiple-Data) technology is presented. This implementation is based on the MMX (MultiMedia eXtensions) and SSE (Streaming SIMD Extensions) technology that is embedded in Intel’s latest microprocessors. Similar technology exists also in other modern microprocessors. Six-fold speed-up relative to the fastest previously known Smith-Waterman implementation on the same hardware was achieved by an optimised 8-way parallel processing approach. A speed of more than 150 million cell updates per second was obtained on a single Intel Pentium III 500MHz microprocessor. This is probably the fastest implementation of this algorithm on a single general-purpose microprocessor described to date. Availability: Online searches with the software are available at http://dna.uio.no/search/ Contact: [email protected] Published in Bioinformatics (2000) 16 (8), 699-706. Copyright © (2000) Oxford University Press. *) To whom correspondence should be addressed.
    [Show full text]