The ISME Journal (2011) 5, 777–779 & 2011 International Society for Microbial Ecology All rights reserved 1751-7362/11 www.nature.com/ismej COMMENTARY The Future of microbial (or is ignorance bliss?)

Jack A Gilbert, Folker Meyer and Mark J Bailey

The ISME Journal (2011) 5, 777–779; doi:10.1038/ microbial diversity studies. Although concerted ismej.2010.178; published online 25 November 2010 efforts are under way to improve the coverage of known genome-derived proteins from wet-lab derived biochemical annotations, the technology to A recent explosion in the number of studies taking link the small body of experimentally generated advantage of the power of next-generation sequen- evidence to large ‘families’ of similar proteins is still cing to explore metagenomic or 16S rRNA taxo- in flux. One major factor that hinders the use of the nomic diversity of microbial environments means existing annotations is the bias in the genome and that, we need to stop and think about how we best protein knowledge bases, for example, the vast interpret these data. majority of sequenced organisms originate from the Currently, 16S rRNA gene studies provide us with medical community. The Genomic Encyclopedia the most effective way of fingerprinting the species of Bacteria and Achaea (GEBA) project has provided richness of a community, with weak links to culture- a substantial increase to our understanding of derived functional relationships. However, the vast microbial from the rest of the phylogenetic increase in the number of bacterial taxa known only matrix, highlighting the importance of whole from 16S rRNA sequence has started to sever this genomes in exploring evolution and the protein link, which can only be corrected through more universe. Importantly, this one study significantly experimental functional characterization requiring increased the number and diversity of novel pro- improved culturing techniques. A diversity study teins, expanding our ability to annotate environ- that uses core-genome processes and key metabolic mental metagenomic data by as much as 4% functions would be an improvement, especially in (Ivanova et al., 2010). the absence of the sequence of every genome for One major concern is the use of different annotation every cell in a system. But even then, we know only pipelines by each sequencing center, which poten- the potential of the system, not how it is regulated or tially produce different results. Simple processes like what is expressed under any given circumstance, comparative genomics routinely require re-analysis of for which metatranscriptomics is required (Gilbert all data involved in the comparison (Dinsdale et al., et al., 2008). When appropriately applied, core- 2008). Attempts to create simple exchange vocabul- genome fingerprints could provide a genuine under- aries have not proven useful for microbial genome standing of the population structure with defined analysis. This highlights two issues: niches, insight to functional variation and how these vary between ecosystems. Currently, our under- (1) With future data volumes (for example, 4300 standing is still very limited, but we do have some billion base pairs per run on a HiSeq2000 ideas about how to proceed. Illumina platform), re-analysis will not be fea- sible because the data analysis cost will dom- inate the sequencing cost (Wilkening et al, 2009). (2) Databases used for metagenomic analysis need to Better be well curated and expanded, the community All interpretation of sequence data currently relies requires sustained investment into annotation on the analysis of sequence similarity, assuming that infrastructure. similar (or near identical) DNA sequences imply similar (or identical) protein function. As numerous studies have shown that the general paradigm International coordination of effort and access to is valid, however, our knowledge of the protein sequencers/super computers universe is less than perfect. Not only are the current The genomic project registry (http://www.genomes annotations of protein-coding genes not compre- online.org) created by Nikos Kyrpides and collea- hensive, but also a large proportion of genes in gues, allows tracking of (meta)genome sequencing newly sequenced microbes and viruses cannot be projects, avoiding costly repetition of identical annotated (or even identified; Roberts, 2004), which experiments. A similar registry will be required for severely limits our ability to use metagenomics for ecologically driven sequencing projects, helping to Commentary 778 avoid duplication of ecosystems, assisting with interplay of different technologies will be para- project design and allowing for the acquisition mount in answering these questions. For example, of comparable data sets. To provide such a project high-throughput 16S rRNA gene studies alone can registry, researchers will need a language to express significantly increase our concept of the diversity of their projects in a computer searchable way. life. Now that Rob Knight has shown that short Through the work of the Genomics Standards regions of the 16S rRNA gene can provide us with as Consortium (GSC; http://www.gensc.org), the com- good a picture of microbial diversity as full length munity is now developing controlled vocabularies reads (Liu et al., 2007), the massive throughput that allow accurate (and machine readable) descrip- of Illumina can be leveraged to run thousands of tions of ecological sequencing projects, enabling parallel 16S rRNA gene projects in a single instru- questions like: ‘Show me all studies of Mediterra- ment run. Understanding how we apply these nean marine sediments in less than 100 meters of techniques to each ecosystem is as important as water.’ This reduces months of paper-searches to how we cope with the computational analysis—for seconds of data acquisition. example, how do we effectively determine the relevant sample size to accurately determine how ecosystem community structure changes over time Coordination of data storage and access or space. For future studies, as sequencing and Traditionally, DNA sequence data are archived at bioinformatics become less of a bottleneck, it will NCBI’s Genbank (Benson et al., 2009). More recently, become important that we examine sampling infra- environmental (metagenomic) sequences have structure, requiring that communities come together been deposited in the short read archive (SRA) to produce standards associated with sampling (http://www.ncbi.nlm.nih.gov/sra). However, SRA volume, technology and application. Understanding deposition and querying is not simple. In addition, the role of spatial scale and sampling volume in it is unclear whether NCBI will continue to function capturing microbial interaction and community as an archive for all DNA reads generated by a structure is vital to these studies. democratized sequencing community. Looking at this from the perspective of a microbial ecology data generator and/or data consumer, it seems clear that Concluding remarks the community needs a comprehensive sequence archive for all 16S rRNA gene and metagenomic The ultimate future goal of our community is to sequence reads. This will provide an important provide a far more detailed understanding of micro- resource for the microbial ecology community, no bial ecology to enable parameterization of ecosystem matter how inexpensive sequencing becomes. The models, which are predictive and descriptive for effort involved in sample extraction and description diversity and metabolism. To do this, we must alone will make long-term storage and provisioning improve knowledge transfer and the intelligent of the sequencing data worthwhile even as technol- interpretation of data at a global scale. Improved ogies change. The exact specifications needed are exchange of ideas and data will inevitably improve already described by current de-facto repositories and advance the theory, perhaps, even help to define (for example, VAMPS (http://vamps.mbl.edu/), the basic rules for biological systems beyond the MG-RAST (Meyer et al., 2008) and CAMERA constant of nucleic acid. But to achieve this, (http://camera.calit2.net/)). While in the past the ecological practices need to be improved and shared community lacked the technology to describe meta- so that metadata and genomic information are based data (experimental setup, sampling strategy, and so on sound experimentation that is built on statisti- on.), through the work of the GSC we can now define cally relevant design. To do this, we need to provide the required metadata, enabling data creators to the support and infrastructure to ensure that mark-up data, and software systems to ingest and samples and information are properly curated and provide ways to query and visualize the data. readily accessible. In this brave new world, one can imagine many portals integrating data relevant for their specific JA Gilbert is at Argonne National Laboratory, missions, thus, creating de-facto archives by down- Argonne, IL, USA loading from the data producers directly. However, JA Gilbert is also at Department of Ecology and if long-term storage (beyond funding cycles) Evolution, University of Chicago, Chicago, IL, USA; is required, resources will need to be dedicated to F Meyer is at Argonne National Laboratory, preserve data sets over long periods of time, which Argonne, IL, USA must be through the existing network of the INSDC F Meyer is also at Computation Institute, University (http://www.insdc.org). of Chicago, Chicago, IL, USA and MJ BaileyMJ Bailey is at NERC Centre for Designing the next generation of experiments Ecology & Hydrology, Crowmarsh Gifford, Of course it is the fundamental question of microbial Wallingford Oxford, UK ecology that will focus future research, and the E-mail: [email protected]

The ISME Journal Commentary 779 References classification of metagenome projects. Environmental 12: 1803–1805. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers Liu Z, Lozupone C, Hamady M, Bushman FD, Knight R. EW. (2009). GenBank. Nucleic Acids Res 37(Database (2007). Short pyrosequencing reads suffice for accu- issue): D26–D31. rate microbial community analysis. Nucleic Acids Res Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, 35: e120. Brulc JM et al. (2008). Functional meta- Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, genomic profiling of nine biomes. Nature 452: Kubal M et al. (2008). The metagenomics RAST 629–632. server – a public resource for the automatic phylo- Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P genetic and functional analysis of metagenomes. et al. (2008). Detection of Large Numbers of Novel BMC bioinformatics 9: 386. Sequences in the Metatranscriptomes of Complex Roberts R. (2004). Identifying Protein Function—A Call for Marine Microbial Communities. PLoS ONE 3: e3042. Community Action. PLoS Biol 2: e42. journal.pone.0003042. Wilkening J, Desai N, Meyer F, Wilke A. (2009). Using Ivanova N, Tringe SG, Liolios K, Liu W-T, Morrison N, clouds for metagenomics — case study. IEEE Cluster Hugenholtz P et al. (2010). A call for standardized 2009; New Orleans.

The ISME Journal