European Nucleotide Archive in 2016

Total Page:16

File Type:pdf, Size:1020Kb

European Nucleotide Archive in 2016 D32–D36 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 28 November 2016 doi: 10.1093/nar/gkw1106 European Nucleotide Archive in 2016 Ana Luisa Toribio*, Blaise Alako, Clara Amid, Ana Cerdeno-Tarr˜ aga,´ Laura Clarke, Iain Cleland, Susan Fairley, Richard Gibson, Neil Goodgame, Petra ten Hoopen, Suran Jayathilaka, Simon Kay, Rasko Leinonen, Xin Liu, JosueMart´ ´ınez-Villacorta, Nima Pakseresht, Jeena Rajan, Kethi Reddy, Marc Rosello, Nicole Silvester, Dmitriy Smirnov, Daniel Vaughan, Vadim Zalunin and Guy Cochrane European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received October 22, 2016; Revised October 25, 2016; Editorial Decision October 26, 2016; Accepted October 31, 2016 ABSTRACT (1), ENA partners with the National Institute of Genet- ics’ DNA DataBank of Japan (2) and the National Center The European Nucleotide Archive (ENA; http://www. for Biotechnology’s GenBank and Sequence Read archive ebi.ac.uk/ena) offers a rich platform for data sharing, (NCBI GenBank and SRA) (3) to provide globally compre- publishing and archiving and a globally comprehen- hensive coverage through routine data exchange, to build sive data set for onward use by the scientific com- the scientific standards for data exchange and to promote munity. With a broad scope spanning raw sequenc- the timely sharing of well-structured sequence data. ing reads, genome assemblies and functional anno- In this paper, we outline ENA content, introduce the core tation, the resource provides extensive data submis- services on offer from the resource and provide entry points sion, search and download facilities across web and for users into these services. We then outline a number of programmatic interfaces. Here, we outline ENA con- developments that have been made in the last year. Finally, tent and major access modalities, highlight major we highlight examples of use of ENA data. developments in 2016 and outline a number of ex- amples of data reuse from ENA. CONTENT We continue to see significant growth in ENA content, with INTRODUCTION doubling times currently at 32 months for raw sequence For the last third of a century, the European Nucleotide data. At the time of writing, ENA comprised 3 × 1015 base Archive (ENA; http://www.ebi.ac.uk/ena) has served as a pairs, over 1.5 million taxa, 192 thousand studies, 770 mil- cornerstone of the world’s bioinformatics infrastructure. In lion sequence records and references to almost 200 thou- 2016, the resource continues as a broad and heavily used sand literature publications. platform for the sharing, publication, safeguarding and ENA content is organized into a number of data types reuse of globally comprehensive public nucleotide sequence (see Figure 1). Core data types, such as raw sequence data and associated information. data and derived data, including sequences, assemblies and ENA content spans a spectrum of data types from raw functional annotation, are supplemented with accessory reads to asserted annotation and offers a broad range of data, such as studies and samples, to provide experimen- services to the scientific community. Submissions services tal context. Primary data (such as reads) and several de- cater for our broad range of data providers, from the small- rived data types (sequence, assembly and analysis) are sub- scale submitting research laboratory to the major sequenc- mitted, while remaining derived data types (coding, non- ing centre. Data access services, such as our search and re- coding, marker and environmental) are derived from sub- trieval interfaces support users, from those browsing casu- mitted content as part of processing and indexing within ally to those integrating ENA content through program- ENA. Further information is provided from http://www.ebi. matic access into their own analyses and software applica- ac.uk/ena/submit/data-formats. tions. Our helpdesk provides responsive support to several Data are highly integrated to provide discoverability, thousand active data submitters and many times this num- reusability and cross-data set interoperability. This integra- ber of data consumers. tion is achieved at three levels; between records of the same As a member of the International Nucleotide Sequence class (such as through consistent standards of annotation Database Collaboration (INSDC; http://www.insdc.org/) in Feature Tables associated with sequence records), be- *To whom correspondence should be addressed. Tel: +44 1223 494680; Fax: +44 1223 494468; Email: [email protected] C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Nucleic Acids Research, 2017, Vol. 45, Database issue D33 primary data derived data Coding Read Sequence - protein-coding regions of annotated sequences - ‘second generaon’ - mul-pass, assembled sequence sequencing raw data - can carry funconal annotaon Non-coding Trace Assembly - non-protein-coding regions of annotated - capillary sequencing raw - genome- or transcriptome-level sequences data construct comprising reads and sequences Marker - subset of coding and non-coding providing phylogenec or other markers accessory data Environmental - subset of samples relang to Sample environmental applicaons - informaon relang to sampling, Study including context and processing - informaon relang to unit of science (e.g. publicaon) Analysis - serves to group other classes, - idenficaon data (e.g. OTUs) Taxon such as sequence, read and - reference read alignments - informaon relang to sequenced sample - processed reads (e.g. trimmed amd organism filtered reads) -genomic variaons (VCF) - future data types Figure 1. Organization of principal data types in ENA. tween records of different classes (such as a sequence record SUBMISSION making reference to a sample and a taxon) and between ENA offers a comprehensive range of submission op- ENA records and records in external resources (such as a tions through the Webin system (http://www.ebi.ac.uk/ena/ coding record carrying a cross-reference to a record in the submit). This system provides both an interactive web ap- UniProt KnowledgeBase) (4). Further discoverability and plication that offers, inter alia, spreadsheet upload support reusability are afforded through collaborations with exter- and a powerful RESTful programmatic submission inter- nal expert communities, in which we work on the develop- face. The former is recommended for infrequent submitters ment and implementation in ENA of reporting standards and first-time users (including those setting up program- initiatives such as the Minimal Information about a Se- matic submission systems). The latter is recommended for quencing Experiment (5) and the Global Microbial Iden- those with informatics skills who wish to establish ongo- tifier (http://www.globalmicrobialidentifier.org/). ing regular data flow for a project or submitting centre. Given the aggressive technological advance that we con- (We note that data volume per se is not a useful guide in tinue to observe in sequencing technology, we expect ongo- choosing which tool to use; since both web and program- ing change in the nature of incoming data (such as new read matic interfaces work in conjunction with file upload over formats) and continued broadening of the application base FTP (supported within the web application and through served by sequencing (and hence new derived data struc- external clients), with web and RESTful interfaces pro- tures). As such, we consider our role as a ‘pioneer’ database viding the user with transactional control over the sub- to be important and we strive to be agile and responsive mission.) For depositions of multi-omics data, submitters in responding to new requirements from the scientific com- should first contact our helpdesk with a brief outline of munity. We support raw data from all sequencing platforms the data set in question and we will work with colleagues and have in place a technical system that allows rapid exten- within the European Molecular Biology Laboratory, Eu- sion as new data emerge. For new applications, our ‘analy- ropean Bioinformatics Institute (EMBL-EBI) BioStudies sis’ data type has been put in place to allow us rapidly to (https://www.ebi.ac.uk/biostudies/) and BioSamples (http: respond to new derived data types; indeed, reference read //www.ebi.ac.uk/biosamples)(6) data resources to provide alignments and genomic variation data are examples of ex- instructions on how to proceed. tended data types that have been supported using this ap- proach. D34 Nucleic Acids Research, 2017, Vol. 45, Database issue ACCESS SELECTED DEVELOPMENTS IN 2016 We offer a range of web, programmatic and FTP services Genome assembly data, especially from bacterial isolates, that support user access to ENA data, covering search, continue to grow relative to other data types. In order to browse and download functions. serve these data better, we have supplemented the existing Three search services are supported: A simple text search HTML view of assemblies with an XML view, available box, available in the header of all pages on the ENA website from the browser and through
Recommended publications
  • Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J
    Downloaded from genome.cshlp.org on September 23, 2021 - Published by Cold Spring Harbor Laboratory Press Methods Gene discovery and annotation using LCM-454 transcriptome sequencing Scott J. Emrich,1,2,6 W. Brad Barbazuk,3,6 Li Li,4 and Patrick S. Schnable1,4,5,7 1Bioinformatics and Computational Biology Graduate Program, Iowa State University, Ames, Iowa 50010, USA; 2Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa 50010, USA; 3Donald Danforth Plant Science Center, St. Louis, Missouri 63132, USA; 4Interdepartmental Plant Physiology Graduate Major and Department of Genetics, Development, and Cell Biology, Iowa State University, Ames, Iowa 50010, USA; 5Department of Agronomy and Center for Plant Genomics, Iowa State University, Ames, Iowa 50010, USA 454 DNA sequencing technology achieves significant throughput relative to traditional approaches. More than 261,000 ESTs were generated by 454 Life Sciences from cDNA isolated using laser capture microdissection (LCM) from the developmentally important shoot apical meristem (SAM) of maize (Zea mays L.). This single sequencing run annotated >25,000 maize genomic sequences and also captured ∼400 expressed transcripts for which homologous sequences have not yet been identified in other species. Approximately 70% of the ESTs generated in this study had not been captured during a previous EST project conducted using a cDNA library constructed from hand-dissected apex tissue that is highly enriched for SAMs. In addition, at least 30% of the 454-ESTs do not align to any of the ∼648,000 extant maize ESTs using conservative alignment criteria. These results indicate that the combination of LCM and the deep sequencing possible with 454 technology enriches for SAM transcripts not present in current EST collections.
    [Show full text]
  • Whole Genome Sequencing Data of Multiple Individuals of Pakistani
    www.nature.com/scientificdata oPeN Whole genome sequencing data DAtA DeScriptor of multiple individuals of Pakistani descent Shahid Y. Khan1, Muhammad Ali1, Mei-Chong W. Lee2, Zhiwei Ma3, Pooja Biswas4, Asma A. Khan5, Muhammad Asif Naeem5, Saima Riazuddin 6, Sheikh Riazuddin5,7,8, Radha Ayyagari4, J. Fielding Hejtmancik 3 & S. Amer Riazuddin1 ✉ Here we report whole genome sequencing of four individuals (H3, H4, H5, and H6) from a family of Pakistani descent. Whole genome sequencing yielded 1084.92, 894.73, 1068.62, and 1005.77 million mapped reads corresponding to 162.73, 134.21, 160.29, and 150.86 Gb sequence data and 52.49x, 43.29x, 51.70x, and 48.66x average coverage for H3, H4, H5, and H6, respectively. We identifed 3,529,659, 3,478,495, 3,407,895, and 3,426,862 variants in the genomes of H3, H4, H5, and H6, respectively, including 1,668,024 variants common in the four genomes. Further, we identifed 42,422, 39,824, 28,599, and 35,206 novel variants in the genomes of H3, H4, H5, and H6, respectively. A major fraction of the variants identifed in the four genomes reside within the intergenic regions of the genome. Single nucleotide polymorphism (SNP) genotype based comparative analysis with ethnic populations of 1000 Genomes database linked the ancestry of all four genomes with the South Asian populations, which was further supported by mitochondria based haplogroup analysis. In conclusion, we report whole genome sequencing of four individuals of Pakistani descent. Background & Summary The completion of Human Genome Project ignited several large scale efforts to characterize variations in the human genome, which led to a comprehensive catalog of the common variants including single-nucleotide polymorphisms (SNPs) and insertions/deletions (indels), across the entire human genome1,2.
    [Show full text]
  • EMBL-EBI Powerpoint Presentation
    Processing data from high-throughput sequencing experiments Simon Anders Use-cases for HTS · de-novo sequencing and assembly of small genomes · transcriptome analysis (RNA-Seq, sRNA-Seq, ...) • identifying transcripted regions • expression profiling · Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs · ChIP-Seq, nucleosome positions, etc. · DNA methylation studies (after bisulfite treatment) · environmental sampling (metagenomics) · reading bar codes Use cases for HTS: Bioinformatics challenges Established procedures may not be suitable. New algorithms are required for · assembly · alignment · statistical tests (counting statistics) · visualization · segmentation · ... Where does Bioconductor come in? Several steps: · Processing of the images and determining of the read sequencest • typically done by core facility with software from the manufacturer of the sequencing machine · Aligning the reads to a reference genome (or assembling the reads into a new genome) • Done with community-developed stand-alone tools. · Downstream statistical analyis. • Write your own scripts with the help of Bioconductor infrastructure. Solexa standard workflow SolexaPipeline · "Firecrest": Identifying clusters ⇨ typically 15..20 mio good clusters per lane · "Bustard": Base calling ⇨ sequence for each cluster, with Phred-like scores · "Eland": Aligning to reference Firecrest output Large tab-separated text files with one row per identified cluster, specifying · lane index and tile index · x and y coordinates of cluster on tile · for each
    [Show full text]
  • An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads
    bioRxiv preprint doi: https://doi.org/10.1101/2020.04.20.050369; this version posted May 28, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. An open-sourced bioinformatic pipeline for the processing of Next-Generation Sequencing derived nucleotide reads: Identification and authentication of ancient metagenomic DNA Thomas C. Collin1, *, Konstantina Drosou2, 3, Jeremiah Daniel O’Riordan4, Tengiz Meshveliani5, Ron Pinhasi6, and Robin N. M. Feeney1 1School of Medicine, University College Dublin, Ireland 2Division of Cell Matrix Biology Regenerative Medicine, University of Manchester, United Kingdom 3Manchester Institute of Biotechnology, School of Earth and Environmental Sciences, University of Manchester, United Kingdom [email protected] 5Institute of Paleobiology and Paleoanthropology, National Museum of Georgia, Tbilisi, Georgia 6Department of Evolutionary Anthropology, University of Vienna, Austria *Corresponding Author Abstract The emerging field of ancient metagenomics adds to these Bioinformatic pipelines optimised for the processing and as- processing complexities with the need for additional steps sessment of metagenomic ancient DNA (aDNA) are needed in the separation and authentication of ancient sequences from modern sequences. Currently, there are few pipelines for studies that do not make use of high yielding DNA cap- available for the analysis of ancient metagenomic DNA ture techniques. These bioinformatic pipelines are tradition- 1 4 ally optimised for broad aDNA purposes, are contingent on (aDNA) ≠ The limited number of bioinformatic pipelines selection biases and are associated with high costs.
    [Show full text]
  • Sequence Alignment/Map Format Specification
    Sequence Alignment/Map Format Specification The SAM/BAM Format Specification Working Group 3 Jun 2021 The master version of this document can be found at https://github.com/samtools/hts-specs. This printing is version 53752fa from that repository, last modified on the date shown above. 1 The SAM Format Specification SAM stands for Sequence Alignment/Map format. It is a TAB-delimited text format consisting of a header section, which is optional, and an alignment section. If present, the header must be prior to the alignments. Header lines start with `@', while alignment lines do not. Each alignment line has 11 mandatory fields for essential alignment information such as mapping position, and variable number of optional fields for flexible or aligner specific information. This specification is for version 1.6 of the SAM and BAM formats. Each SAM and BAMfilemay optionally specify the version being used via the @HD VN tag. For full version history see Appendix B. Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1 in using the POSIX / C locale. Regular expressions listed use the POSIX / IEEE Std 1003.1 extended syntax. 1.1 An example Suppose we have the following alignment with bases in lowercase clipped from the alignment. Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. Coor 12345678901234 5678901234567890123456789012345 ref AGCATGTTAGATAA**GATAGCTGTGCTAGTAGGCAGTCAGCGCCAT +r001/1 TTAGATAAAGGATA*CTG +r002 aaaAGATAA*GGATA +r003 gcctaAGCTAA +r004 ATAGCT..............TCAGC -r003 ttagctTAGGC -r001/2 CAGCGGCAT The corresponding SAM format is:2 1Charset ANSI X3.4-1968 as defined in RFC1345.
    [Show full text]
  • Genomic Data Standards Resources and Initiatives Cited in the Supplemental Information to the Genomic Data Sharing Policy
    Genomic Data Standards Resources and Initiatives Cited in the Supplemental Information to the Genomic Data Sharing Policy IMPORTANT NOTE: The National Institutes of Health makes no endorsement of the non-NIH-funded genomic data standards resources and/or initiatives included in this document. Categories Resources and Initiatives Common Data Element Resource Portal: http://www.nlm.nih.gov/cde/ NIH encourages the use of common data elements (CDEs) in clinical research, patient registries, and other human subject research in order to improve data quality and opportunities for comparison and combination of data from multiple studies and with electronic health records. This portal provides access to NIH-supported CDE initiatives and other tools and resources that can assist investigators developing protocols for data collection. Clinical Genome Resource (ClinGen): http://www.nih.gov/news/health/sep2013/nhgri-25.htm In 2013, the NIH National Human Genome Research Institute (NHGRI) and the Eunice Kennedy Shriver National Institute NIH-Funded of Child Health and Human Development (NICHD) awarded three grants totalling over $25 million to support a consortium of research groups to design and implement a framework for evaluating variants that are relevant to patient care (e.g., play a role in disease). International Collaboration for Clinical Genomics (ICCG): http://www.iccg.org/about-the-iccg/clingen/ ICCG was awarded as a grant under ClinGen. ICCG was charged with developing standard formats for gathering and depositing data in ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/). It will work with a variety of different stakeholder groups, including clinical laboratories and existing locus-specific databases, to obtain robust data sets on genomic variants and disease associations.
    [Show full text]
  • Comparative Analysis of Pacbio and Oxford Nanopore Sequencing Technologies for Transcriptomic Landscape Identification of Penaeus Monodon
    life Brief Report Comparative Analysis of PacBio and Oxford Nanopore Sequencing Technologies for Transcriptomic Landscape Identification of Penaeus monodon Zulema Udaondo 1 , Kanchana Sittikankaew 2, Tanaporn Uengwetwanit 2 , Thidathip Wongsurawat 1,3 , Chutima Sonthirod 4, Piroon Jenjaroenpun 1,3 , Wirulda Pootakham 4, Nitsara Karoonuthaisiri 2 and Intawat Nookaew 1,* 1 Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA; [email protected] (Z.U.); [email protected] (T.W.); [email protected] (P.J.) 2 National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency, Pathum Thani 12120, Thailand; [email protected] (K.S.); [email protected] (T.U.); [email protected] (N.K.) 3 Division of Bioinformatics and Data Management for Research, Department of Research and Development, Faculty of Medicine, Siriraj Hospital, Mahidol University, Bangkok 10700, Thailand 4 National Omics Center (NOC), National Science and Technology Development Agency (NSTDA), Pathum Thani 12120, Thailand; [email protected] (C.S.); [email protected] (W.P.) * Correspondence: [email protected]; Tel.: +1-501-686-6025; Fax: +1-501-603-1766 Citation: Udaondo, Z.; Sittikankaew, K.; Uengwetwanit, T.; Wongsurawat, Abstract: With the advantages that long-read sequencing platforms such as Pacific Biosciences T.; Sonthirod, C.; Jenjaroenpun, P.; (Menlo Park, CA, USA) (PacBio) and Oxford Nanopore Technologies (Oxford, UK) (ONT) can offer, Pootakham, W.; Karoonuthaisiri, N.; various research fields such as genomics and transcriptomics can exploit their benefits. Selecting an Nookaew, I. Comparative Analysis of appropriate sequencing platform is undoubtedly crucial for the success of the research outcome, PacBio and Oxford Nanopore thus there is a need to compare these long-read sequencing platforms and evaluate them for specific Sequencing Technologies for research questions.
    [Show full text]
  • Multi-Scale Analysis and Clustering of Co-Expression Networks
    Multi-scale analysis and clustering of co-expression networks Nuno R. Nen´e∗1 1Department of Genetics, University of Cambridge, Cambridge, UK September 14, 2018 Abstract 1 Introduction The increasing capacity of high-throughput genomic With the advent of technologies allowing the collec- technologies for generating time-course data has tion of time-series in cell biology, the rich structure stimulated a rich debate on the most appropriate of the paths that cells take in expression space be- methods to highlight crucial aspects of data struc- came amenable to processing. Several methodologies ture. In this work, we address the problem of sparse have been crucial to carefully organize the wealth of co-expression network representation of several time- information generated by experiments [1], including course stress responses in Saccharomyces cerevisiae. network-based approaches which constitute an excel- We quantify the information preserved from the origi- lent and flexible option for systems-level understand- nal datasets under a graph-theoretical framework and ing [1,2,3]. The use of graphs for expression anal- evaluate how cross-stress features can be identified. ysis is inherently an attractive proposition for rea- This is performed both from a node and a network sons related to sparsity, which simplifies the cumber- community organization point of view. Cluster anal- some analysis of large datasets, in addition to being ysis, here viewed as a problem of network partition- mathematically convenient (see examples in Fig.1). ing, is achieved under state-of-the-art algorithms re- The properties of co-expression networks might re- lying on the properties of stochastic processes on the veal a myriad of aspects pertaining to the impact of constructed graphs.
    [Show full text]
  • Globalfungi, a Global Database of Fungal Occurrences from High
    www.nature.com/scientificdata OPEN GlobalFungi, a global database DATA DEscrIPTor of fungal occurrences from high-throughput-sequencing metabarcoding studies Tomáš Větrovský1,6, Daniel Morais1,6, Petr Kohout1,6, Clémentine Lepinay1,6, Camelia Algora1, Sandra Awokunle Hollá1, Barbara Doreen Bahnmann1, Květa Bílohnědá1, Vendula Brabcová1, Federica D’Alò2, Zander Rainier Human1, Mayuko Jomura 3, Miroslav Kolařík1, Jana Kvasničková1, Salvador Lladó1, Rubén López-Mondéjar1, Tijana Martinović1, Tereza Mašínová1, Lenka Meszárošová1, Lenka Michalčíková1, Tereza Michalová1, Sunil Mundra4,5, Diana Navrátilová1, Iñaki Odriozola 1, Sarah Piché-Choquette 1, Martina Štursová1, Karel Švec1, Vojtěch Tláskal 1, Michaela Urbanová1, Lukáš Vlk1, Jana Voříšková1, Lucia Žifčáková1 & Petr Baldrian 1 ✉ Fungi are key players in vital ecosystem services, spanning carbon cycling, decomposition, symbiotic associations with cultivated and wild plants and pathogenicity. The high importance of fungi in ecosystem processes contrasts with the incompleteness of our understanding of the patterns of fungal biogeography and the environmental factors that drive those patterns. To reduce this gap of knowledge, we collected and validated data published on the composition of soil fungal communities in terrestrial environments including soil and plant-associated habitats and made them publicly accessible through a user interface at https://globalfungi.com. The GlobalFungi database contains over 600 million observations of fungal sequences across > 17 000 samples with geographical locations and additional metadata contained in 178 original studies with millions of unique nucleotide sequences (sequence variants) of the fungal internal transcribed spacers (ITS) 1 and 2 representing fungal species and genera. The study represents the most comprehensive atlas of global fungal distribution, and it is framed in such a way that third-party data addition is possible.
    [Show full text]
  • Closure of the NCBI SRA and Implications for the Long-Term Future of Genomics Data Storage GB Editorial Team*
    GB Editorial Team Genome Biology 2011, 12:402 http://genomebiology.com/2011/12/3/402 EDITORIAL Closure of the NCBI SRA and implications for the long-term future of genomics data storage GB Editorial Team* e National Center for Biotechnology Information accept submissions doesn’t meant that the SRA is closing, (NCBI) in the US recently announced that, as a result of merely changing and the European Nucleotide Archive budgetary constraints, it would no longer be accepting (ENA) at EMBL-EBI will remain. e NCBI’s decision submissions to its Sequence Read Archive (SRA) and that was based on budgetary constraints. It should be noted over the course of the next year or so it would slowly that most people don’t realize that storage space is only a phase out support for this database (http://www.ncbi. minor fraction of the budget of the database; the bulk of nlm.nih.gov/sra). ere seems to be a certain amount of the cost is associated with the staff who maintain the confusion in the community about what effect this database, process the submissions, develop the software decision will have. At Genome Biology we feel that the and so on. free availability of data is an important concept for science, SS: From the outside, it appears that the SRA is closing so we asked the views of various interested people on what because of NIH budgetary considerations. One problem the short-term implications of this announcement will is that the amount of sequence being generated is be, and also how they envisaged the future of data storage growing at an extraordinary rate, probably faster than in the long term.
    [Show full text]
  • NGS Raw Data Analysis
    NGS raw data analysis Introduc3on to Next-Generaon Sequencing (NGS)-data Analysis Laura Riva & Maa Pelizzola Outline 1. Descrip7on of sequencing plaorms 2. Alignments, QC, file formats, data processing, and tools of general use 3. Prac7cal lesson Illumina Sequencing Single base extension with incorporaon of fluorescently labeled nucleodes Library preparaon Automated cluster genera3on DNA fragmenta3on and adapter ligaon Aachment to the flow cell Cluster generaon by bridge amplifica3on of DNA fragments Sequencing Illumina Output Quality Scores Sequence Files Comparison of some exisng plaGorms 454 Ti RocheT Illumina HiSeqTM ABI 5500 (SOLiD) 2000 Amplificaon Emulsion PCR Bridge PCR Emulsion PCR Sequencing Pyrosequencing Reversible Ligaon-based reac7on terminators sequencing Paired ends/sep Yes/3kb Yes/200 bp Yes/3 kb Read length 400 bp 100 bp 75 bp Advantages Short run 7mes. The most popular Good base call Longer reads plaorm accuracy. Good improve mapping in mulplexing repe7ve regions. capability Ability to detect large structural variaons Disadvantages High reagent cost. Higher error rates in repeat sequences The Illumina HiSeq! Stacey Gabriel Libraries, lanes, and flowcells Flowcell Lanes Illumina Each reaction produces a unique Each NGS machine processes a library of DNA fragments for single flowcell containing several sequencing. independent lanes during a single sequencing run Applications of Next-Generation Sequencing Basic data analysis concepts: • Raw data analysis= image processing and base calling • Understanding Fastq • Understanding Quality scores • Quality control and read cleaning • Alignment to the reference genome • Understanding SAM/BAM formats • Samtools • Bedtools Understanding FASTQ file format FASTQ format is a text-based format for storing a biological sequence and its corresponding quality scores.
    [Show full text]
  • L3: Short Read Alignment to a Reference Genome
    L3: Short Read Alignment to a Reference Genome Shamith Samarajiwa CRUK Autumn School in Bioinformatics Cambridge, September 2017 Where to get help! http://seqanswers.com http://www.biostars.org http://www.bioconductor.org/help/mailing-list Read the posting guide before sending email! Overview ● Understand the difference between reference genome builds ● Introduction to Illumina sequencing ● Short read aligners ○ BWA ○ Bowtie ○ STAR ○ Other aligners ● Coverage and Depth ● Mappability ● Use of decoy and sponge databases ● Alignment Quality, SAMStat, Qualimap ● Samtools and Picard tools, ● Visualization of alignment data ● A very brief look at long reads, graph genome aligners and de novo genome assembly Reference Genomes ● A haploid representation of a species genome. ● The human genome is a haploid mosaic derived from 13 volunteer donors from Buffalo, NY. USA. ● In regions where there is known large scale population variation, sets of alternate loci (178 in GRCh38) are assembled alongside the reference locus. ● The current build has around 500 gaps, whereas the first version had ~150,000 gaps. Genome Reference Consortium: https://www.ncbi.nlm.nih.gov/grc £600 30X Sequencers http://omicsmap.com Illumina sequencers Illumina Genome Analyzer Illumina sequencing technology ● Illumina sequencing is based on the Solexa technology developed by Shankar Balasubramanian and David Klenerman (1998) at the University of Cambridge. ● Multiple steps in “Sequencing by synthesis” (explained in next slide) ○ Library Preparation ○ Bridge amplification and Cluster generation ○ Sequencing using reversible terminators ○ Image acquisition and Fastq generation ○ Alignment and data analysis Illumina Flowcell Sequencing By Synthesis technology Illumina Sequencing Bridge Amplification Sequencing Cluster growth Incorporation of fluorescence, reversibly terminated tagged nt Multiplexing • Multiplexing gives the ability to sequence multiple samples at the same time.
    [Show full text]