Infrastructure for Metagenome Data Management and Analysis

INFRASTRUCTURE FOR METAGENOME DATA MANAGEMENT AND ANALYSIS Tatiana Tatusova National Center for Biotechnology Information, National Library of Medicine National Institutes of Health, 9600 Rockville Pike, Bethesda, MD, 20892, U.S.A. Keywords: Database, Sequence analysis, Metagenomics. Abstract: Metagenome sequencing projects are generating unprecedented amounts of data. Public sequence archive databases are challenged with large-scale data management issues including data storage, quick search and retrieval of the sequence data for further analysis. The sequence data is linked to the rich set of metadata attributes such as geochemical and ecological parameters for environmental projects and clinical patient information for human microbiome studies. That complex collection of heterogeneous information has to be integrated, organized and presented to the users in a meaningful and the most useful way. For the last 20 years The National Center for Biotechnology Information (NCBI) has been developing the infrastructure that allows an easy storage and distribution of various types of bimolecular data as well as data integration and easy navigation in complex information space. Here we describe NCBI resources that are used for metagenomics data management. 1 INTRODUCTION Sequence data generated by metagenome projects is made available to the research community New generation sequencing technology made it through public data achieves described in section 1. possible to study microbial communities in their Sequence data by itself doesn’t contain enough natural environment. By collecting samples directly information for the analysis, it is necessary to frame from the environment and sequencing DNA without the physic-chemical context within which the data is isolation and growing in the artificial conditions to be interpreted. The initial step in any metagenom- researches are given an opportunity to understand ics study requires the collection of samples destined the role of microbial organisms in ecological sys- for analysis. The geo-chemical or medical characte- tems. The questions the researches are usually ask ristics describing a sample constitute the "meta data" are: intricately tied to a given sample and aid in interpret- ing the biological significance of the genetic infor- 1) what is the structure of microbial community and mation. Linking the metadata to the sequence data relative abundance of different species?; extracted from the sample is one of the key elements 2) What is the functional role of the bacterial com- in further analysis. Computational analysis of the munities in the ecosystem? In other words scientists metagenomics creates a great challenge due to the want to know “who they are?” and “what they do?” huge volume of the metagenomics data requiring One well established way to answer the first ques- extremely powerful computational resources and tion is to collect and sequence 16S RNA genes and novel approaches to sequence analysis and visualiza- perform phylogenetic analysis. To answer the tion methods. second question genomic DNA, assembled and an- NCBI has recently developed new resources that notated. The analysis of the predicted proteins might allow capturing some metadata associated with se- provide some insight into the functional role of bac- quence submission such as the description of the terial communities in the regulation of biochemical project, description of each sample and geochemical processes in ecosystems. More recently with the and ecological data associated with the study. Sec- RNA Seq technology metatranscriptome data be- tion 2 will provide the detailed description of the came available for the analysis the expression level specialized resources. of functional activity of microbial communities. Tatusova T.. 357 INFRASTRUCTURE FOR METAGENOME DATA MANAGEMENT AND ANALYSIS. DOI: 10.5220/0003333803570362 In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (Meta-2011), pages 357-362 ISBN: 978-989-8425-36-2 Copyright c 2011 SCITEPRESS (Science and Technology Publications, Lda.) BIOINFORMATICS 2011 - International Conference on Bioinformatics Models, Methods and Algorithms In addition to general archive databases NCBI al database with a file-based and column-oriented has created specialized resources and tools that can design. be utilized in metagenomics data analysis. These Within SRA the data are organized into four resources are discussed in section 3. types of records: studies (SRP accessions), experiments (SRX accessions), samples (SRS accessions) and runs (SRR accessions). Studies contain one or 2 PRIMARY DATA ARCHIVES more experiments, each of which contains one or more runs, each of which in turn may contain data on tens of millions of individual reads. The various As a national resource for molecular biology infor- record types representing data from a study are all mation, National Center for Biotechnology Informa- linked to one another within Entrez tion develops, distributes, supports, and coordinates access to a variety of databases and software for the (www.ncbi.nlm.nih.gov/sra/), allowing users to browse the data easily on the web. scientific and medical communities (Sayers, 2010). 2.1 Sequence Read Archive 2.2 GenBank – Nucleotide Sequence Archive Database The advent of massively parallel sequencing technologies has opened an extensive new vista of re- GenBank (Benson et al., 2010) is a comprehensive search possibilities — elucidation of the human database that contains publicly available nucleotide microbiome, discovery of polymorphisms and muta- sequences for more than 300 000 organisms named tions in individual genomes, mapping of protein– at the genus level or lower, obtained primarily DNA interactions, and positioning of nucleosomes through submissions from individual laboratories — to name just a few. In order to achieve these and batch submissions from large-scale sequencing research goals, researchers must be able to effective- projects, including whole genome shotgun (WGS) ly store, access, and manipulate the enormous vo- and environmental sampling projects. NCBI builds lume of read data generated from massively parallel GenBank primarily from the submission of sequence sequencing experiments. data from authors and from the bulk submission of In response to the research community’s call for expressed sequence tag (EST), genome survey se- such a resource, NCBI, EBI, and DDBJ, under the quence (GSS) and other high-throughput data from auspices of the International Nucleotide Sequence sequencing centers. GenBank data is available at no Database Collaboration (INSDC), have developed cost over the Internet, through FTP and a wide range the Sequence Read Archive (SRA) data storage and of web-based retrieval and analysis services. retrieval system (Shumway, 2010). The SRA not only provides a place where researchers can archive 2.3 Metadata: BioProject and their sequence read data, but also enables them to BioSample quickly access known data and their associated experimental descriptions (metadata). BioProject. New technologies have significantly Now that the archive has reached an initial state increased the volume of data that can be generated of completion and is publically available at NCBI, it and submitted to archival database resources. Ge- is being deployed at EBI (under the name European nome project is no longer limited to the genome Read Archive, or ERA), and soon will also be dep- sequencing, assembly, and annotation. New types of loyed at DDBJ (under the name DDBJ Read Arc- experimental studies include epigenomics, proteo- hive, or DRA). NCBI and EBI have already begun mics, metabolomics and more ’omics’. Advances in exchanging data, and once the DRA is in place at sequencing technologies have also changed the DDBJ, there will be a regular data exchange be- scope of genomic studies; it became possible to tween all three INSDC members. sequence multiple genomes of many different organ- In order to store and retrieve the enormous isms starting from hundreds of bacterial strains to amount of data generated by massively parallel se- 1000 human individuals. It is also possible to sequencing technologies, NCBI, EBI and DDBJ quence microbial populations in their natural envi- needed to create a data repository that has much of ronment without growing them in culture but by the power of a relational database while being sequencing the samples collected from the environ- lightweight, transportable and flexible like flat-file ment. Our view on genomic, metagenomics and storage. The solution was to create a hybrid relation- biomedical projects is rapidly changing. That affects the way the data is organized and represented in 358 INFRASTRUCTURE FOR METAGENOME DATA MANAGEMENT AND ANALYSIS NCBI databases. BioProject database provides a 2.4 Developing Community Standards mechanism to access datasets that are otherwise for Metagenome Data difficult to find. The definition of a set of related data, a ‘project’ is flexible and supports the need to The astonishing increase in the amount of data gen- define a complex project and various distinct sub- erated by metagenomics projects that involve shot- projects using different parameters. gun sequencing of all the organisms in an environ- BioSample. The time, place and collection method mental sample creates an unanticipated situation in can profoundly affect the microbial composition in a the field. Data storage and retrieval is becoming a sample. Geographical location, biochemical

Infrastructure for Metagenome Data Management and Analysis

Whole Genome Sequencing Data of Multiple Individuals of Pakistani

Genomic Data Standards Resources and Initiatives Cited in the Supplemental Information to the Genomic Data Sharing Policy

Comparative Analysis of Pacbio and Oxford Nanopore Sequencing Technologies for Transcriptomic Landscape Identiﬁcation of Penaeus Monodon

Globalfungi, a Global Database of Fungal Occurrences from High

Closure of the NCBI SRA and Implications for the Long-Term Future of Genomics Data Storage GB Editorial Team*

L3: Short Read Alignment to a Reference Genome

An Introduction to Next-Generation Sequencing Technology

Sampling Complementary RNA Reads from the Sequence Read Archive Mario Stanke1,2* , Willy Bruhn1, Felix Becker1,2 and Katharina J

SRA Handbook

A Database of Human Immune Receptor Alleles Recovered from Population Sequencing Data

Training Modules Documentation Release 1

Assessing Metagenomic Signals Recovered from Lyuba, a 42,000-Year-Old Permafrost-Preserved Woolly Mammoth Calf