A Bioinformatician's Guide to Metagenomics
Total Page:16
File Type:pdf, Size:1020Kb
MICROBIOLOGY AND MOLECULAR BIOLOGY REVIEWS, Dec. 2008, p. 557–578 Vol. 72, No. 4 1092-2172/08/$08.00ϩ0 doi:10.1128/MMBR.00009-08 Copyright © 2008, American Society for Microbiology. All Rights Reserved. A Bioinformatician’s Guide to Metagenomics Victor Kunin,1 Alex Copeland,2 Alla Lapidus,3 Konstantinos Mavromatis,4 and Philip Hugenholtz1* Microbial Ecology Program,1 Quality Assurance Department,2 Microbial Genomics Department,3 and Genome Biology Program,4 DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, California Downloaded from INTRODUCTION .......................................................................................................................................................557 PRESEQUENCING CONSIDERATIONS ..............................................................................................................558 Community Composition .......................................................................................................................................558 Selection of Sequencing Technology.....................................................................................................................559 How Much Sequence Data?...................................................................................................................................559 SAMPLING AND DATA GENERATION................................................................................................................561 Sample Collection for Metagenomes and Other Molecular Analyses.............................................................561 Sample Metadata Collection .................................................................................................................................561 Premetagenome Community Composition Profiling..........................................................................................561 http://mmbr.asm.org/ Shotgun Library Preparation................................................................................................................................562 Sequencing ...............................................................................................................................................................562 SEQUENCE PROCESSING .....................................................................................................................................563 Sequence Read Preprocessing...............................................................................................................................563 Assembly...................................................................................................................................................................565 Finishing ..................................................................................................................................................................566 Gene Prediction and Annotation ..........................................................................................................................567 DATA ANALYSIS .......................................................................................................................................................568 Postsequencing Community Composition Estimates.........................................................................................568 Binning .....................................................................................................................................................................570 on October 26, 2015 by University of Queensland Library Analyzing Dominant Populations .........................................................................................................................571 Gene-Centric Analysis............................................................................................................................................572 DATA MANAGEMENT .............................................................................................................................................573 CONCLUDING REMARKS......................................................................................................................................575 ACKNOWLEDGMENTS ...........................................................................................................................................575 REFERENCES ............................................................................................................................................................575 INTRODUCTION technologies producing hundreds of megabases of data for well under $20,000 (see “Sequencing”), metagenomics is within the For the purposes of this review, we define metagenomics as reach of many laboratories. the application of shotgun sequencing to DNA obtained di- In this review, we address the bioinformatic aspects of ana- rectly from an environmental sample or series of related sam- ples, producing at least 50 Mbp of randomly sampled sequence lyzing metagenomic data sets, stressing the differences with data. This distinguishes it from functional metagenomics, as standard genomic analyses. Although our focus is on bioinfor- reviewed elsewhere previously (58), whereby environmental matics, we will begin by considering experimental planning and DNA is cloned and screened for specific functional activities of implementation of metagenomic projects, as these aspects can interest. Metagenomics is a derivation of conventional micro- have major impacts on subsequent bioinformatic analyses. bial genomics, with the key difference being that it bypasses Throughout the review, we will follow the workflow of a the requirement for obtaining pure cultures for sequencing. typical metagenomic project at the Joint Genome Institute Therefore, metagenomics holds the promise of revealing the (JGI) (summarized in Fig. 1). This process begins with sample genomes of the majority of microorganisms that cannot be and metadata collection and proceeds with DNA extraction, readily obtained in pure culture (62). In addition, since the library construction, sequencing, read preprocessing, and as- samples are obtained from communities rather than isolated sembly. Genes are then called on either reads, contigs, or both, populations, metagenomics may serve to establish hypotheses and binning is applied. Community composition analysis is concerning interactions between community members. employed at several stages of this workflow, and databases are Indeed, metagenomics is increasingly being viewed as a used to facilitate the analysis. All of these stages will be dis- baseline technology for understanding the ecology and evolu- cussed in detail below. We expect that some details of the tion of microbial ecosystems, upon which hypotheses and ex- workflow will be different in other sequencing facilities, and perimental strategies are built (145), and with new sequencing some aspects may be difficult to reproduce in a small research laboratory embarking alone on a metagenomic project without the support of a dedicated facility. Moreover, the rapid ad- * Corresponding author. Mailing address: Microbial Ecology Pro- vancement of sequencing technologies will change the suite of gram, DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA. Phone: (925) 296-5725. Fax: (925) 296-5720. E-mail: tools available for metagenomic analysis. Therefore, rather [email protected]. than focusing on available tools, we emphasize the consider- 557 558 KUNIN ET AL. MICROBIOL.MOL.BIOL.REV. analysis. For example, one of the main reasons that the hindgut Sample of a higher rather than lower termite was sequenced (146) is collection Storage for other molecular analyses because the former lacks protist symbionts. When sequencing e.g. proteomics, transcriptomics, viruses microbial communities that are found in tight symbiotic rela- tionships with eukaryotic hosts, the removal of host cells or Cellular extracted host DNA is important to avoid eukaryotic contam- DNA ination. For example, in the analysis of a gutless worm micro- Metagenomic pipeline bial symbiont community, host cells were physically separated Downloaded from from bacterial endosymbiont populations using a Nycodenz Library construction Database gradient (150). Simply excluding eukaryotes from a metagenomic analysis is Metadata not ideal from an ecological perspective, as it compromises our Sequencing Community ability to assess a microbial community in its entirety. An composition analysis alternative or complementary strategy could be to obtain mo- 16S Basecalling sequences lecular data at the RNA (metatranscriptomics) or protein 16S sequencing (metaproteomics) level, thus bypassing the problem of large amounts of noncoding eukaryotic sequence data. Emerging http://mmbr.asm.org/ Vector trimming Reads sequencing technologies such as pyrosequencing (89) may ul- 16S microarray timately allow metagenomic sequencing of communities com- prising eukaryotes, but the data are likely to present numerous Assembly Contigs FISH challenges for many downstream bioinformatic analyses (see “Selection of Sequencing Technology”). Gene prediction Genes Within the sequence-tractable bacterial, archaeal, and viral Phylogenetic components of a community, community complexity should be marker genes Binning assessed prior to shotgun sequencing (see “Premetagenome Bins Community Composition Profiling” for a description of assess- on October 26, 2015 by University of Queensland Library ment methods). Community complexity