Vol 455|25 September 2008 Q&A

MICROBIOLOGY Metagenomics Philip Hugenholtz and Gene W. Tyson Ten years after the term metagenomics was coined, the approach continues to gather momentum. This culture-independent, molecular way of analysing environmental samples of cohabiting microbial populations has opened up fresh perspectives on .

Why the ‘meta’ in metagenomics? this way and we have been ignorant of their Metagenomics is now also being adopted in Genomics determines the complete genetic existence. This cultivation bottleneck has medicine. Of particular note is an international complement of an organism by high-through- skewed our view of microbial diversity and initiative, the Human Project, put sequencing of the base pairs of its DNA. limited our appreciation of the microbial which aims to map human-associated micro- The most prominent example was the Human world. Meta genomics provides a relatively bial communities (including those of the gut, Genome Project, which involved the sequencing unbiased view not only of the community mouth, skin and vagina). of 3 billion base pairs. But the genomes of hun- structure (species richness and distribution) dreds of organisms from all three domains of but also of the functional (metabolic) potential What surprises have there been? life (, and eukarya), as well as of a community. A strength of metagenomics is its potential those of quasi-life forms such as , have for serendipitous discovery. An example is now been sequenced. Metagenomics, by con- What environments can be analysed? the discovery of proteorhodopsin proteins, trast, involves sampling the genome sequences In principle, any environment is amenable to light-driven proton pumps that were first of a community of organisms inhabiting a com- metagenomic analysis provided that nucleic identified in environmental DNA from bac- mon environment. Metagenomics has also been acids can be extracted from sample material terioplankton. Proteorhodopsins have since more broadly defined as any type of analysis of (Fig. 1). Simpler communities are more trac- been found to be widely distributed and highly DNA obtained directly from the environment table to a particular technique called shotgun expressed in diverse microbial groups from — for example, after the appropriate pro cedures, sequencing (Box 1, overleaf) — this was a aquatic , and they may represent a screening such DNA for particular enzymatic rationale for one of the earliest studies, which major source of energy flux in the photic zone activity. To date, the approach has been applied targeted a biofilm in acid drainage from mines of the world’s oceans. A more recent discovery exclusively to microbial communities. that consisted of only a handful of dominant is that of archaeal ammonia oxidizers. It was microbial populations. Most interest, how- thought that bacteria were solely responsible Why do we need metagenomics? ever, has centred on the marine environment: for aerobic ammonia oxidation, although Microbiology has traditionally been based the largest metagenomic study to date is the their numbers often could not account for on pure cultures grown in the laboratory. But Global Ocean Sampling Expedition, which fol- the observed rates of ammonia oxidation in most microorganisms cannot be grown in lows the voyage of Darwin’s ship HMS Beagle. many habitats. The fortuitous discovery of

Bras del Port saltern Acid mine Mediterranean drainage biofilm Human gut Sea Phosphorus-removing Nine biomes Sargasso Sea microbiome (US) bioreactors Whale fall; Eel river Minnesota Guerrero Gutless worm Soils farm soil Soudan Mine Negro sediments microbiome Marine viral Drinking (anaerobic Termite gut hypersaline community water methane Human gut Mouse gut microbiome mat oxidizers) viriome microbiome

Jan 03 Aug 03 Mar 04 Oct 04 Apr 05 Nov 05 May 06 Dec 06 Jun 06 Jan 08 Aug 08

Marine RNA Neanderthal Pleistocene cave viriome Human faeces bear fossils microbiome Coral Coral reef viral community holobiont Hawaii Ocean Mammoth Time Series fossil Oceanic Global Ocean Human gut viriomes Sampling microbiome (Japan)

Figure 1 | Timeline of sequence-based metagenomic projects showing the variety of environments sampled since 2002. The oceanic viriomes (all viruses in a ) (August 2006) were from the Sargasso Sea, Gulf of Mexico, coastal British Columbia and the Arctic Ocean. The nine biomes (March 2008) were stromatolites, fish gut, fish ponds, mosquito viriome, human-lung viriome, chicken gut, bovine gut and marine viriome. The different technologies used are dye-terminator shotgun sequencing (black), fosmid library sequencing (pink) and pyrosequencing (green). (Graphic based on data sets represented at www.genomesonline.org.)

481 NEWS & VIEWS Q&A NATURE|Vol 455|25 September 2008

an ammonia monooxygenase gene next to Box 1 | The nuts and bolts of metagenomics an archaeal marker gene (encoding small- Metagenomics begins with functional or evolutionary many sequence data are subunit ribosomal RNA) spawned a rash of the extraction of genomic interest, and these clones obtained and the relative papers implicating archaea as the main source DNA from cellular organisms were then fully sequenced to abundance of the organism in of ammonia oxidation in many marine and and/or viruses in an provide contextual data. This its community. For example, terrestrial ecosystems. environmental sample. For is still a useful approach, but assuming completely random ‘traditional’ dye-terminator such directed sequencing sampling, a population Can whole genomes be reconstructed sequencing, the DNA is has largely been replaced with an average genome from an environmental sample? sheared into uniform lengths by ‘shotgun’ sequencing size of 3 million base pairs, Yes: the genomes of dominant species can be and inserted into a vector of randomly sampled, and comprising 0.1% of a fully reconstructed from environmental sam- — a known DNA fragment anonymous DNA, which community, would require ples using random sequencing. For example, that can be moved between is cheaper and has higher 3 billion bases of sequence complete or near-complete genomes have been organisms. The vector is then throughput. data to obtain a 1× coverage assembled from microbial populations present replicated in a bacterial host As with isolate genomics, (each base of the genome in biofilms in acid mine drainage, in activated (typically Escherichia coli), metagenomic data- is represented on average sludges and in marine samples. Having the com- which produces many clones processing usually involves by one read). This is beyond plete or near-complete genome of a dominant of the genome fragment the assembly of short, the range of dye-terminator suitable for sequencing. overlapping sequence sequencing, but new, highly population provides the gene inventory for the Current sequencing reads into a consensus parallelized sequencing organism and allows its metabolic potential to technologies produce ‘reads’ sequence, and prediction technologies, such as 454– be determined, including inferring the absence of fewer than 1,000 bases of which stretches of Roche pyrosequencing and of metabolic pathways as well as their presence. (Fig. 3), so thousands of reads sequence encode genes. In Illumina sequencing, may be A key feature of genomes obtained from envi- are required to re-cover whole metagenomics, however, up to the job (Fig. 3). ronmental sources is that they are composites genomes. As with single- complex communities may A desirable extra step in of the population from which they were derived, genome (isolate) projects, a not result in any assembly, metagenomics is to identify and encompass the genetic microheterogeneity range of sizes of DNA can be as no population may have the owner of each anonymous present in that population. cloned. been sampled a sufficient DNA fragment, a process Initially, metagenomic number of times to result in called binning, or classification. What have we learned about studies focused on screening overlapping reads. Binning methods are still in microbial evolution? libraries of large-insert The amount of an their infancy, and usually only Metagenomics provides the first broad insights clones (fosmids and bacterial organism’s genome recovered longer fragments (5 kilobases into coexisting (sympatric) populations, as artificial chromosomes) for from an environmental or more) can be classified every sequence read is derived from a differ- clones containing genes of sample depends on how reliably. P.H. & G.W.T. ent individual within a given community. In communities in which deep sequence-read coverage of individual populations is possible, most dominant population in a soil sample, Are we starting to get a handle on metagenomics provides an exquisite view of the and many times that to obtain genomes from genetic diversity in the environment? evolutionary processes shaping these organ- less dominant populations. But it is possible to Scarcely: we are still far from measuring the full isms. For instance, data from archaeal popula- extract biologically meaningful information extent of genetic diversity encoded by micro- tions in acid mine drainage were used to show from completely unassembled sequence data bial life. The Global Ocean Sampling Expedi- that genetic recombination occurs at a much using a gene-centric approach. tion has generated the largest metagenomic higher frequency than previously predicted, data set so far, comprising 6.12 million and is the primary evolutionary force shaping … and the gene-centric approach is? predicted proteins from 7.7 million shotgun these populations. And data from the Pacific Rather than considering a community from sequences. The predicted proteins from this and Atlantic oceans revealed that the greatest the point of view of its component organisms data set represented all previously known variation within populations of Prochlorococcus (genomes), gene-centric analysis considers a families of microbial proteins and added 1,700 (the most abundant photosynthetic organisms community from the viewpoint of its compo- new ones (each with more than 20 representa- in the ocean) occurs in genomic ‘islands’. These nent genes. Genes that are found more fre- tives). However, the rate of discovery of new islands are discrete regions in the genome that quently in one community than another are protein families, containing two or more repre- are believed to be hotspots for genomic inno- assumed to endow a beneficial function on sentatives, was linear with the addition of new vation and derived in part by genes laterally that community (Fig. 2). For example, pro- sequences. This is not surprising in light of transferred by viruses. Potentially the most teorhodopsins are very common in marine recent estimates of microbial diversity obtained important finding relates to the controversial surface waters compared with other habitats, using marker genes, such as those encoding topic of defining microbial species. Several and enzymes that break down cellulose are ribosomal RNAs, which project the number of metagenomic studies have shown clear discon- more common in the termite hindgut than microbial species globally to be in the hundreds tinuity of microbial diversity, suggesting that in other habitats. Over-represented genes of of millions to billions. some microbial species at least are defined by unknown function (the majority) can pro- discrete ‘sequence space’. vide foci for research that improves the odds Can metagenomics go beyond of making biological discoveries. Gene-cen- functional prediction? Are some environments too complex tric analysis can also be taken up a level to see Strictly speaking, no. Metagenomics only for metagenomics? if over-represented genes are part of higher looks at the gene sequences that encode pro- Certainly it seems unfeasible (at least for now) functional units such as metabolic pathways. teins or functional RNAs. To go beyond this to obtain complete genomes of organisms from For example, genes involved in photosynthe- DNA-based view, expressed RNA transcripts habitats with complex microbial communities sis are generally over-represented in ocean and translated proteins must be obtained from because of the sheer amount of sequencing surface waters compared with other habitats. environmental samples and examined directly. required. For example, it was estimated that Conversely, however, genes of low relative This is already happening. RNAs can be a minimum of 6 billion base pairs would be abundance supplying essential functions will reverse-transcribed into DNA and sequenced, required to obtain the genome sequence of the not be detected by this approach. and the mass spectra of protein fragments can

482 NATURE|Vol 455|25 September 2008 NEWS & VIEWS Q&A be determined and identified by reference to a their gene sequences. These approaches have given rise to the fields of metatranscriptomics and metaproteomics.

Can metagenomics be used for viruses? Certainly. Viral communities have been the subject of several metagenomic investigations and were among the earliest to be studied Adaptive gene for habitat a using the method (Fig. 1). These studies, and Adaptive gene for habitat b viral sequences cropping up in metagenomic b Essential gene data in general, all point to a central role for viruses in microbial evolution and ecology. Initially, only double-stranded DNA viruses were accessible through cloning, but the newer cloneless sequencing technologies allow access to all types (such as single-stranded and RNA viruses).

… and for eukaryotes? Yes. But because of the cost, sequencing Figure 2 | Gene-centric analysis. Genes from communities a and b are assigned to their respective projects have tried to avoid them: eukary- gene families (red, green or blue) and counted to highlight functions that are putatively beneficial (red otes in general have much larger genomes over-represented in community a, and green over-represented in community b), or essential or having and a higher proportion of DNA that doesn’t a generalized function (blue; high counts in both communities). The organisms that contribute the genes are largely ignored in this approach, and the underlying assumption is that high counts indicate code for protein. Metagenomic analyses of functions that are important for survival in a given habitat. (Graphic by S. Tringe.) insect–microbe symbioses, for example, have tried to exclude the eukaryotic host DNA. As sequencing costs continue to fall, become viable, then all-versus-all comparisons from our existing knowledge of biochemistry. parti cularly with the development of higher- of metagenomic data will not be feasible in the Before the advent of metagenomics, however, throughput technologies, eukaryotes should near future. On the upside, it is unnecessary we were completely oblivious to what we didn’t become a tractable component of a meta- to compare all data with each other to extract know (unknown unknowns); now we have the genomic analysis. biological insights, and the biology can drive blueprints, although we can’t read many of the the selection of a relevant subset of available instructions (known unknowns). What will eukaryotic metagenomics metagenomic data. tell us? Who can use metagenomics? A common misconception is that animals and What other bottlenecks are there? An increasing number of labs: metagenomics plants constitute the bulk of eukaryotic diver- The gap between characterized and hypothetical is becoming a basic technology for under- sity. In fact, most of the evolutionary diversity proteins identified in metagenomes is widen- standing the ecology and evolution of micro- exists in the under-studied microbial eukary- ing at an alarming rate. Next to computational bial ecosystems, upon which hypotheses and otes, which are typically unicellular and as dif- resources, uncharacterized gene products are experimental strategies are built. And with new ficult to culture as their bacterial and archaeal likely to be the biggest bottleneck for the fore- sequencing technologies producing more than cousins. For this reason, metagenomics will be seeable future. This means that our under- 100 megabases of data for less than US$20,000 an excellent way to study microbial eukaryotic standing of microbial ecosystems will be partial (Fig. 3), metagenomics is now within the reach biology (and thereby eukaryotic biology as a at best, being based on what we can infer of many researchers. ■ whole). However, this has not yet happened Philip Hugenholtz is in the Microbial Ecology because most microbial eukaryotes also have Platform Million Cost Average Program, DOE Joint Genome Institute, large genomes. Even for plants and animals, base per read length Walnut Creek, California 94598, USA. pairs base (base a metagenomic approach may be warranted. per run (US¢) pairs) Gene W. Tyson is in the Department of Civil Basing our understanding of a species on the and Environmental Engineering, Massachusetts genome sequence of one or a few individu- Dye-terminator 0.07 0.1 700 Institute of Technology, Cambridge, (ABI 3730xl) als misses a lot of genetic information, as is Massachusetts 02139, USA. beginning to be appreciated for the human 454–Roche 400 0.003 400 e-mails: [email protected]; [email protected] population. In the case of large multicellular pyrosequencing (GS FLX titanium) FURTHER READING eukaryotes such as humans, the equivalent of Tyson, G. W. et al. Community structure and metabolism metagenomics is to sequence the genomes of Illumina 2,000 0.0007 35 sequencing (GAii) through reconstruction of microbial genomes from the many individuals. environment. Nature 428, 37–43 (2004). Figure 3 | Comparison of the cost and throughput Edwards, R. A. & Rohwer, F. Viral metagenomics. Nature Rev. of sequencing technologies. Microbiol. 3, 504–510 (2005). Will computational capacity keep pace? New technologies DeLong, E. F. et al. Community genomics among stratified That remains to be seen. Sequence data are (454–Roche pyrosequencing and Illumina microbial assemblages in the ocean’s interior. Science 311, increasing at a rate higher than increases in sequencing) generate far more sequence data 496–503 (2006). computational power. Even more problematic per run, at a much lower cost than conventional Turnbaugh, P. J. et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444, than simple storage of sequence data are the dye-terminator sequencing, but the reads are shorter. Improvements in these technologies 1027–1031 (2006). ‘all-versus-all’ sequence comparisons required are already producing longer reads, and Yooseph, S. et al. The Sorcerer II Global Ocean Sampling to best interpret metagenomes, which raise the Expedition: expanding the universe of protein families. single-molecule sequencing (an as-yet-unproven PLoS Biol. 5, e16 (2007). computational requirements exponentially. emerging technology) holds the promise of longer Warnecke, F. et al. Metagenomic and functional analysis of Unless there is a radical breakthrough in reads still. The dye-terminator approach may hindgut microbiota of a wood-feeding higher termite. computing, for example if quantum computers soon be obsolete. Nature 450, 560–565 (2007).

483