The European Nucleotide Archive

Total Page:16

File Type:pdf, Size:1020Kb

The European Nucleotide Archive D28–D31 Nucleic Acids Research, 2011, Vol. 39, Database issue Published online 23 October 2010 doi:10.1093/nar/gkq967 The European Nucleotide Archive Rasko Leinonen*, Ruth Akhtar, Ewan Birney, Lawrence Bower, Ana Cerdeno-Ta´ rraga, Ying Cheng, Iain Cleland, Nadeem Faruque, Neil Goodgame, Richard Gibson, Gemma Hoad, Mikyung Jang, Nima Pakseresht, Sheila Plaister, Rajesh Radhakrishnan, Kethi Reddy, Siamak Sobhany, Petra Ten Hoopen, Robert Vaughan, Vadim Zalunin and Guy Cochrane European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received September 15, 2010; Accepted October 3, 2010 ABSTRACT archived nucleotide data. All primary data in the INSDC belongs to the submitters and can only be The European Nucleotide Archive (ENA; http://www updated with submitter consent. For full policy details .ebi.ac.uk/ena) is Europe’s primary nucleotide- please refer to: http://www.insdc.org/policy.html. sequence repository. The ENA consists of three main databases: the Sequence Read Archive (SRA), the Trace Archive and EMBL-Bank. The ob- CONTENT jective of ENA is to support and promote the use of In October 2010, the ENA contained 500 billion raw and nucleotide sequencing as an experimental research assembled sequences consisting of 50 trillion base pairs. platform by providing data submission, archive, In the last 3 years, the next-generation sequence reads search and download services. In this article, we stored in the Sequence Read Archive (SRA) have outline these services and describe major changes become the largest and fastest growing source of new and improvements introduced during 2010. These data accounting now for 95% of all base pairs made available by ENA. At the same time, the number of include extended EMBL-Bank and SRA-data sub- completed genome sequences has risen to over 1400 for mission services, extended ENA Browser function- cellular organisms and 3000 for viruses and phages (http:// ality, support for submitting data to the European www.ebi.ac.uk/genomes/). Genome-phenome Archive (EGA) through SRA, and the launch of a new sequence similarity search service. SUBMISSIONS OF RAW DATA FROM NEXT GENERATION PLATFORMS THE EUROPEAN NUCLEOTIDE ARCHIVE The SRA accepts sequence submissions from next-generation sequencing platforms. New submitters The European Nucleotide Archive (ENA) operates as a should contact [email protected] for the creation of a public archive for nucleotide sequence data. By bringing submission account and a secure data upload area. together databases for raw sequence data, assembly infor- Submitters first upload data files into the secure mation and functional annotation, the ENA provides a data-upload area in one of the supported data formats, comprehensive and integrated resource for this fundamen- then prepare and submit study, sample, experiment, run tal source of biological information. Central to the ENA is and submission XML files to SRA. Detailed submission the provision of submission services, including interactive instructions are available here: http://www.ebi.ac.uk/ena/ and programmatic submission tools, search services, about/page.php?page=sra_submissions. including text and sequence similarity search tools and We have extended the SRA submission service to data presentation and retrieval services. The ENA works support submissions of authorized access data, typically closely together with NCBI (1) and DDBJ (2) as partners clinical samples that have been sequenced under a confi- in the International Nucleotide Sequence Database dentiality and consent agreement. Authorized access data Collaboration (3). The principal policy of INSDC is to can now be submitted through the SRA submission provide free and unrestricted permanent access to all service into the European Genome-phenome Archive *To whom correspondence should be addressed. Tel: +44 1223 494608; Fax: +44 1223 494468; Email: [email protected] ß The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2011, Vol. 39, Database issue D29 Figure 1. The complete mitochondrial genome for Ursus spelaeus (cave bear) from the Max Planck Institute for Evolutionary Anthropology submitted to EMBL-Bank in 2010. D30 Nucleic Acids Research, 2011, Vol. 39, Database issue (EGA; http://www.ebi.ac.uk/ega). Data submitted to Browser to cover EMBL-Bank and Trace archive records EGA are not part of the public SRA database and are and introduced several improvements including a graph- excluded from the INSDC data exchange. Permission to ical EMBL-Bank annotation and assembly viewer and view and retrieve authorised access data can only be intuitive navigation between different ENA data classes. granted by the external data access committee (DAC) For full details of the ENA browser URL syntax please responsible for the data concerned. Please contact refer to: http://www.ebi.ac.uk/ena/about/page.php?page [email protected] for more information about =browser. For example, the following URL returns the EGA policies. A secure data upload area is required to complete mitochondrial genome for ‘Ursus spelaeus’ (cave submit authorised access data through the SRA submis- bear) (6): http://www.ebi.ac.uk/ena/data/view/FM177760 sion service. It is also possible to submit EGA’s policy, (Figure 1). Data can be queried using the EB-Eye free dataset and DAC objects through SRA. text search functionality available in the header section SRA will shortly accept sequence read submissions in of all EBI web pages (7). ENA results are available Binary Alignment/Map (BAM) format (4). A BAM file is under the ‘Nucleotide Sequences’ category and linked to a binary compressed representation of the Sequence the ENA Browser. Free text search is also available from Alignment/Map (SAM) format. With sequence read align- the ENA home page: http://www.ebi.ac.uk/ena. ments becoming an increasingly common intermediate in Rapid and comprehensive sequence similarity searches primary analysis, BAM format is emerging as a popular against ENA data are supported through a new service choice for storing sequence reads with alignments. The based on Exonerate (8) technology: http://www.ebi.ac SRA is currently finalizing an archive BAM specification .uk/ena/search/ (Goodgame, N., manuscript in prepar- which will standardize the use of BAM files for primary ation). All nucleotide sequences archived by the INSDC data archival purposes. Once completed, BAM submis- and made available as part of EMBL-Bank are covered by sions to SRA archives will be required to follow this our service. This includes all ENA sequences except raw specification. reads from the Trace Archive and SRA. Experimental search support for a limited number of raw reads is provided through De-Bruijn servers based on Velvet (9), SUBMISSIONS OF ASSEMBLED AND ANNOTATED using the Exonerate client-server protocol and being fully SEQUENCES integrated with our search service. This search is available EMBL-Bank is a comprehensive public database of by selecting the ‘Experimental De Bruijn search’ option nucleotide sequences, associated biological annotation from the search page. The EMBL-Bank sequence search and bibliographic information. It contains a large diver- service is currently being expanded for more specific sity of data from patent, expressed sequence tag, whole purposes according to community requests. genome shotgun and other high-throughput sequences, Bulk download of EMBL-Bank data is supported through genomic assemblies and richly annotated through FTP at ftp://ftp.ebi.ac.uk/pub/databases/embl/, sequence fragments to whole replicons (5). Submitters and SRA and Trace Archive data through FTP at ftp:// should navigate to http://www.ebi.ac.uk/ena/about/page ftp.sra.ebi.ac.uk/ and Aspera through fasp.sra.ebi.ac.uk. .php?page=submissions for access to all submission services. Advice regarding EMBL-Bank submissions is ENA COMMUNITY available from [email protected]. We have extended the web-based EMBL-Bank submis- The ENA team welcomes feedback and suggestions sion service in a number of ways. For providers of relating to all of our services at [email protected]. We genome-scale data, we have added functionality that are always interested in hearing from potential collabor- allows data submissions in EMBL-Bank flat file format. ators who have an interest in working with and integrating For smaller scale submissions, we have added new tem- our services. plates to the EMBL-Bank submission service. Each template focuses on a particular commonly occurring type of sequence and annotation data and collects FUNDING required information from the submitters using a web The ENA is funded by the European Molecular Biology form or spreadsheet upload. New templates are available Laboratory, European Commission and the Wellcome for unannotated WGS submissions with only source Trust. Funding for open access charge: European organism annotation, and for protein coding and Molecular Biology Laboratory. phylogenetic-marker regions. The template mechanism, introduced in 2009, has been well received and attracts Conflict of interest statement. None declared. now up to half of all web-based EMBL-Bank submissions. REFERENCES DATA SEARCH, BROWSING AND RETRIEVAL 1. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and ENA data can be browsed and retrieved in XML, HTML, Sayers,E.W. (2010) GenBank. Nucleic Acids Res., 38, D46–D51. 2. Kaminuma,E., Mashima,J., Kodama,Y., Gojobori,T., fasta, fastq and flat file formats using the ENA Browser Ogasawara,O., Okubo,K., Takagi,T. and Nakamura,Y. (2010) which can be used both interactively and programmatic- DDBJ launches a new archive database with analytical tools for ally through REST URLs. In 2010, we extended the ENA next-generation sequence data. Nucleic. Acids Res., 38, D33–D38. Nucleic Acids Research, 2011, Vol.
Recommended publications
  • Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository ANALYSIS OF THE IMPACT OF SEQUENCING ERRORS ON BLAST USING FAULT INJECTION BY SO YOUN LEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2013 Urbana, Illinois Adviser: Professor Ravishankar K. Iyer ABSTRACT This thesis investigates the impact of sequencing errors in post-sequence computational analyses, including local alignment search and multiple sequence alignment. While the error rates of sequencing technology are commonly reported, the significance of these numbers cannot be fully grasped without putting them in the perspective of their impact on the downstream analyses that are used for biological research, forensics, diagnosis of diseases, etc. I approached the quantification of the impact using fault injection. Faults were injected in the input sequence data, and the analyses were run. Change in the output of the analyses was interpreted as the impact of faults, or errors. Three commonly used algorithms were used: BLAST, SSEARCH, and ProbCons. The main contributions of this work are the application of fault injection to the reliability analysis in bioinformatics and the quantitative demonstration that a small error rate in the sequence data can alter the output of the analysis in a significant way. BLAST and SSEARCH are both local alignment search tools, but BLAST is a heuristic implementation, while SSEARCH is based on the optimal Smith-Waterman algorithm.
    [Show full text]
  • Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J
    Perspective Cite This: J. Am. Chem. Soc. 2019, 141, 14463−14479 pubs.acs.org/JACS Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J. Gray, Lukasz G. Migas, Perdita E. Barran, Kevin Pagel, Peter H. Seeberger, ∥ ⊥ # ⊗ ∇ Claire E. Eyers, Geert-Jan Boons, Nicola L. B. Pohl, Isabelle Compagnon, , × † Göran Widmalm, and Sabine L. Flitsch*, † School of Chemistry & Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, U.K. ‡ Institute for Chemistry and Biochemistry, Freie Universitaẗ Berlin, Takustraße 3, 14195 Berlin, Germany § Biomolecular Systems Department, Max Planck Institute for Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany ∥ Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, U.K. ⊥ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, United States # Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States ⊗ Institut Lumierè Matiere,̀ UMR5306 UniversitéLyon 1-CNRS, Universitéde Lyon, 69622 Villeurbanne Cedex, France ∇ Institut Universitaire de France IUF, 103 Blvd St Michel, 75005 Paris, France × Department of Organic Chemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden *S Supporting Information established connection between their structure and their ABSTRACT: Carbohydrates possess a variety of distinct function, full characterization of unknown carbohydrates features with
    [Show full text]
  • Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J
    Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press Methods Gene discovery and annotation using LCM-454 transcriptome sequencing Scott J. Emrich,1,2,6 W. Brad Barbazuk,3,6 Li Li,4 and Patrick S. Schnable1,4,5,7 1Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, Iowa 50010, USA; 2Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50010, USA; 3Donald Danforth Plant Science Center, St. Louis, Missouri 63132, USA; 4Interdepartmental Plant Physiology Graduate Major and Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, Iowa 50010, USA; 5Department of Agronomy and Center for Plant Genomics, Iowa State University, Ames, Iowa 50010, USA 454 DNA sequencing technology achieves significant throughput relative to traditional approaches. More than 261,000 ESTs were generated by 454 Life Sciences from cDNA isolated using laser capture microdissection (LCM) from the developmentally important shoot apical meristem (SAM) of maize (Zea mays L.). This single sequencing run annotated >25,000 maize genomic sequences and also captured ∼400 expressed transcripts for which homologous sequences have not yet been identified in other species. Approximately 70% of the ESTs generated in this study had not been captured during a previous EST project conducted using a cDNA library constructed from hand-dissected apex tissue that is highly enriched for SAMs. In addition, at least 30% of the 454-ESTs do not align to any of the ∼648,000 extant maize ESTs using conservative alignment criteria. These results indicate that the combination of LCM and the deep sequencing possible with 454 technology enriches for SAM transcripts not present in current EST collections.
    [Show full text]
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Bioinformatics Study of Lectins: New Classification and Prediction In
    Bioinformatics study of lectins : new classification and prediction in genomes François Bonnardel To cite this version: François Bonnardel. Bioinformatics study of lectins : new classification and prediction in genomes. Structural Biology [q-bio.BM]. Université Grenoble Alpes [2020-..]; Université de Genève, 2021. En- glish. NNT : 2021GRALV010. tel-03331649 HAL Id: tel-03331649 https://tel.archives-ouvertes.fr/tel-03331649 Submitted on 2 Sep 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITE GRENOBLE ALPES préparée dans le cadre d’une cotutelle entre la Communauté Université Grenoble Alpes et l’Université de Genève Spécialités: Chimie Biologie Arrêté ministériel : le 6 janvier 2005 – 25 mai 2016 Présentée par François Bonnardel Thèse dirigée par la Dr. Anne Imberty codirigée par la Dr/Prof. Frédérique Lisacek préparée au sein du laboratoire CERMAV, CNRS et du Computer Science Department, UNIGE et de l’équipe PIG, SIB Dans les Écoles Doctorales EDCSV et UNIGE Etude bioinformatique des lectines: nouvelle classification et prédiction dans les génomes Thèse soutenue publiquement le 8 Février 2021, devant le jury composé de : Dr. Alexandre de Brevern UMR S1134, Inserm, Université Paris Diderot, Paris, France, Rapporteur Dr.
    [Show full text]
  • Genbank Dennis A
    D48–D53 Nucleic Acids Research, 2012, Vol. 40, Database issue Published online 5 December 2011 doi:10.1093/nar/gkr1202 GenBank Dennis A. Benson, Ilene Karsch-Mizrachi, Karen Clark, David J. Lipman, James Ostell and Eric W. Sayers* National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA Received September 30, 2011; Revised November 14, 2011; Accepted November 17, 2011 ABSTRACT sequence (GSS), whole-genome shotgun (WGS) and Õ other high-throughput data from sequencing centers. GenBank is a comprehensive database that The US Office of Patents and Trademarks also contributes contains publicly available nucleotide sequences sequences from issued patents. GenBank participates with for more than 250 000 formally described species. the EMBL Nucleotide Sequence Database (EMBL-Bank), Downloaded from These sequences are obtained primarily through part of the European Nucleotide Archive (ENA) (2), and submissions from individual laboratories and batch the DNA Data Bank of Japan (DDBJ) (3) as a partner submissions from large-scale sequencing projects, in the International Nucleotide Sequence Database including whole-genome shotgun (WGS) and envir- Collaboration (INSDC). The INSDC partners exchange onmental sampling projects. Most submissions data daily to ensure that a uniform and comprehensive http://nar.oxfordjournals.org/ are made using the web-based BankIt or standalone collection of sequence information is available worldwide. Sequin programs, and accession numbers are NCBI makes the GenBank data available at no cost over the Internet, through FTP and a wide range of web-based assigned by GenBank staff upon receipt. Daily data retrieval and analysis services (4).
    [Show full text]
  • Whole Genome Sequencing Data of Multiple Individuals of Pakistani
    www.nature.com/scientificdata oPeN Whole genome sequencing data DAtA DeScriptor of multiple individuals of Pakistani descent Shahid Y. Khan1, Muhammad Ali1, Mei-Chong W. Lee2, Zhiwei Ma3, Pooja Biswas4, Asma A. Khan5, Muhammad Asif Naeem5, Saima Riazuddin 6, Sheikh Riazuddin5,7,8, Radha Ayyagari4, J. Fielding Hejtmancik 3 & S. Amer Riazuddin1 ✉ Here we report whole genome sequencing of four individuals (H3, H4, H5, and H6) from a family of Pakistani descent. Whole genome sequencing yielded 1084.92, 894.73, 1068.62, and 1005.77 million mapped reads corresponding to 162.73, 134.21, 160.29, and 150.86 Gb sequence data and 52.49x, 43.29x, 51.70x, and 48.66x average coverage for H3, H4, H5, and H6, respectively. We identifed 3,529,659, 3,478,495, 3,407,895, and 3,426,862 variants in the genomes of H3, H4, H5, and H6, respectively, including 1,668,024 variants common in the four genomes. Further, we identifed 42,422, 39,824, 28,599, and 35,206 novel variants in the genomes of H3, H4, H5, and H6, respectively. A major fraction of the variants identifed in the four genomes reside within the intergenic regions of the genome. Single nucleotide polymorphism (SNP) genotype based comparative analysis with ethnic populations of 1000 Genomes database linked the ancestry of all four genomes with the South Asian populations, which was further supported by mitochondria based haplogroup analysis. In conclusion, we report whole genome sequencing of four individuals of Pakistani descent. Background & Summary The completion of Human Genome Project ignited several large scale efforts to characterize variations in the human genome, which led to a comprehensive catalog of the common variants including single-nucleotide polymorphisms (SNPs) and insertions/deletions (indels), across the entire human genome1,2.
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • EMBL-EBI Powerpoint Presentation
    Processing data from high-throughput sequencing experiments Simon Anders Use-cases for HTS · de-novo sequencing and assembly of small genomes · transcriptome analysis (RNA-Seq, sRNA-Seq, ...) • identifying transcripted regions • expression profiling · Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs · ChIP-Seq, nucleosome positions, etc. · DNA methylation studies (after bisulfite treatment) · environmental sampling (metagenomics) · reading bar codes Use cases for HTS: Bioinformatics challenges Established procedures may not be suitable. New algorithms are required for · assembly · alignment · statistical tests (counting statistics) · visualization · segmentation · ... Where does Bioconductor come in? Several steps: · Processing of the images and determining of the read sequencest • typically done by core facility with software from the manufacturer of the sequencing machine · Aligning the reads to a reference genome (or assembling the reads into a new genome) • Done with community-developed stand-alone tools. · Downstream statistical analyis. • Write your own scripts with the help of Bioconductor infrastructure. Solexa standard workflow SolexaPipeline · "Firecrest": Identifying clusters ⇨ typically 15..20 mio good clusters per lane · "Bustard": Base calling ⇨ sequence for each cluster, with Phred-like scores · "Eland": Aligning to reference Firecrest output Large tab-separated text files with one row per identified cluster, specifying · lane index and tile index · x and y coordinates of cluster on tile · for each
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA Thomas C. Collin1, *, Konstantina Drosou2, 3, Jeremiah Daniel O’Riordan4, Tengiz Meshveliani5, Ron Pinhasi6, and Robin N. M. Feeney1 1School of Medicine, University College Dublin, Ireland 2Division of Cell Matrix Biology Regenerative Medicine, University of Manchester, United Kingdom 3Manchester Institute of Biotechnology, School of Earth and Environmental Sciences, University of Manchester, United Kingdom [email protected] 5Institute of Paleobiology and Paleoanthropology, National Museum of Georgia, Tbilisi, Georgia 6Department of Evolutionary Anthropology, University of Vienna, Austria *Corresponding Author Abstract The emerging field of ancient metagenomics adds to these Bioinformatic pipelines optimised for the processing and as- processing complexities with the need for additional steps sessment of metagenomic ancient DNA (aDNA) are needed in the separation and authentication of ancient sequences from modern sequences. Currently, there are few pipelines for studies that do not make use of high yielding DNA cap- available for the analysis of ancient metagenomic DNA ture techniques. These bioinformatic pipelines are tradition- 1 4 ally optimised for broad aDNA purposes, are contingent on (aDNA) ≠ The limited number of bioinformatic pipelines selection biases and are associated with high costs.
    [Show full text]
  • Sequence Alignment/Map Format Specification
    Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345.
    [Show full text]