NATURE|Vol 453|29 May 2008 TECHNOLOGY FEATURE Exploring unseen communities Advances in technology and tools for analysis are allowing researchers to unravel the environmental diversity of microbes faster and in greater detail than ever before. Nathan Blow reports.

This year marks the tenth birthday for meta- all these new data are straining downstream genomics — the cloning and functional analysis. “Computational analysis of meta- analysis of the collective of previ- genomic data still has quite a few outstanding ously unculturable soil microorganisms in questions,” says Isidore Rigoutsos, manager an attempt to reconstruct and characterize of the bioinformatics and pattern-discovery 454 LIFE SCIENCES individual community inhabitants. Since the group at IBM’s Thomas J. Watson Research term was coined by Jo Handelsman and her Center in Yorktown Heights, New York. The colleagues at the University of Wisconsin in assembly and prediction of gene function for Madison, its scope has expanded greatly with high-complexity microbial communities still descriptions of the microbial inhabitants of poses challenges1, for example (see ‘Bench- environments as diverse as the human gut, marks and standards’). the air over New York, the Sargasso Sea and Maybe it is the promise of rapidly improv- honeybee colonies. And within these commu- ing sequencing technology or the new envi- nities researchers are now uncovering a wider ronments being explored, but Peterson says range of microorganisms, thanks in large part that she has seen a growing interest in large to advances in DNA-sequencing technology. metagenomics projects — particularly the “We can look at the metagenomic analysis Human Microbiome Project, which aims to so much more deeply, at such a better cost,” unravel the microbial communities associated says Jane Peterson, associate director of with various parts of the human body, includ- the Division of Extramural Research of the ing the gut (see page 578). “People somehow National Human Research Insti- identify with the Human Microbiome Project. tute in Bethesda, Maryland, which recently It is interesting how this project, especially as launched a five-year initiative to explore the it is studying the gut, has really caught a lot of human microbiome. people’s attention.” Although sequencing technology is creat- The 454 Life Sciences GS FLX sequencing system Over the past few years, the race to sequence ing opportunities for metagenomics research, is used in many metagenomics projects. DNA faster and more cheaply has been taking

BENCHMARKS AND STANDARDS The complexity of microbial the computational tools had growing number of researchers are software and data sets to communities can vary drastically, an increasingly hard time,” using next-generation sequencing benchmark results should help from a couple of microorganisms to says Rigoutsos. For most high- platforms and generating the solve some of the complexity thousands or even millions, making complexity samples, he says, quantity of data that in the past problems associated with the reconstruction of whole the genome assemblers could might only have been possible metagenomics. “If you sequence genomes from some samples not generate larger contigs, at large genome centres. Several sufficiently, even 200 base-pair tricky. “If the community is low in and several contigs that were companies are developing software reads are enough,” says Rigoutsos. complexity, it should allow one to assembled were actually chimaeric to address this issue. But he adds that the real question reconstruct genomes with high mixtures of sequences. CLC bio in Cambridge, is how many 200 base-pair reads accuracy,” says Isidore Rigoutsos, For metagenomic analysis, Massachusetts, offers the CLC will be needed before we can truly manager of the bioinformatics smaller contigs and single reads Genomics Workbench, which understand complex communities. and pattern-discovery group at make assigning the sequence to a provides reference assemblies of Others are finding that with IBM’s Thomas J. Watson Research specific microorganism difficult. data from various next-generation enough reads, fewer than 200 Center in Yorktown Heights, New “We want to be able to assign a read sequencing systems as well as base pairs might be sufficient. Jens York. But when it comes to highly of less than 1,000 ,” mutation detection. A future Stoye from Bielefeld University in complex communities, things are says Rigoutsos, which might allow version of the program will Germany has compared a data set less straightforward. researchers to determine species incorporate algorithms for the de of 35 reads generated on Rigoutsos and his team have composition from high-complexity novo assembly of Sanger as well as the Genome Analyzer from Illumina tested several genome assemblers samples without the need to next-generation sequence data. in San Diego, California, with a and gene-prediction tools on generate larger contigs. Meanwhile, Geospiza in Seattle, 454 data set for the same low- simulated metagenomic data sets Rigoutsos and his colleagues Washington, and GenomeQuest complexity sample. Although 99% with varying degrees of complexity. have made three simulated data in Westborough, Massachusetts, of the Genome Analyzer’s sequence Knowing the composition of the sets available to researchers are developing software to data were discarded, because the community allowed the team to interested in testing assembly and analyse data generated by Applied system generates up to 50 million benchmark and evaluate the tools. prediction programs. Biosystems SOLID next-generation reads he could assign the species in “We found that as the The problem of data analysis is sequencing platform. the sample with the same efficiency complexity increased, many of not restricted to metagenomics — a The combination of assembly from both data sets. N.B.

687 TECHNOLOGY FEATURE METAGENOMICS NATURE|Vol 453|29 May 2008

centre stage. Several next-generation DNA per reaction, relying instead she notes that studies using sequencing systems are now available, boasting on more reads to generate a 16S ribosomal RNA pose a

gigabase outputs for a variety of genetic appli- greater number of base calls. problem for 454 read lengths. K. NELSON cations. But when it comes to sequencing envi- These shorter read lengths are 16S ribosomal RNA is ronmental samples that contain many different suitable for applications such sequenced to give a sense microorganisms in varying amounts, the next- as candidate gene resequenc- of species abundance and generation options have their limitations. ing, transcriptional profiling composition in a commu- “I would say the only next-generation and microRNA discovery. nity without reconstruct- sequencing technology suitable for meta- But the 454 system’s longer ing entire genomes from the genomics at the moment is the 454 system,” read length has attracted inhabitants. It is around 1,500 says Stephan Schuster a biochemist at Pennsyl- metagenomics researchers, base pairs long, which can be vania State University in University Park. and is enticing the community sequenced in two or three Schuster is not alone — almost all metagen- away from Sanger sequencing, Sanger reads. But, according omic studies currently being reported rely on which provides reads of 700– Karen Nelson is one of many to Nelson, with the 454 sys- either 454 technology or conventional Sanger 1,000 base pairs per reaction. investigators working on the tem, researchers will have to sequencing. The main reason is simple: read Forest Rohwer, a micro- Human Microbiome Project. focus on a variable region of length. biologist at San Diego State the gene to identify the spe- University in California, sees the advantages cies or come up with other metrics, instead of Long-term tool of 454 when analysing sequence data from relying on full-length reads. Developed by 454 Life Sciences in Branford, environmental samples. “We know that when Nelson thinks one approach might be to Connecticut, the 454 system relies on an you are below 35 base pairs it is hard to get combine Sanger capillary sequencing with 454, emulsion polymerase chain reaction (PCR) information out, at 100 base pairs you tend so that Sanger methodology could be used to step that is coupled to . Indi- to lose information, but if you get to 200 base sequence the ends of inserts from large-scale vidual fragments of DNA, 300–500 base pairs pairs you start to gain a lot of information,” he cosmid or fosmid libraries, which would act as long, are attached to beads in vitro and ampli- says. Although he also adds that it is not clear scaffolds for the placement of the shorter, but fied with PCR to generate millions of identi- at the moment how much could be gained more numerous, 454 reads. cal copies on each bead. Fragments are then from even longer reads. But the length of 454 runs could soon be sequenced by use of a massively parallel reac- “The information content might be a little less of an issue. “We are getting very close to tion format in 1.6 million wells on a picotitre different because the Sanger reads are longer. entire exons and whole genes with our next plate. Using this system, researchers can gen- At present, 454 is between 200 and 400 base version,” says Michael Egholm, vice-president erate around 250 base pairs of sequence per pairs, depending on who you talk to, so you are of research and development at 454 Life Sci- reaction while performing 400,000 reads in a not likely to cover an open reading frame, but ences, noting that the company plans to intro- single instrument run. you do get a significant amount of data,” says duce a 500-base-pair capacity instrument later Other next-generation sequencing systems Karen Nelson, an investigator at the J. Craig this year. typically generate fewer than 50 base pairs Venter Institute in Rockville, Maryland. But Diverse results The widespread use of next-generation sequencing technology to explore microbial communities has brought something else to light — greater microorganism diversity. Next- generation systems bypass the cloning of DNA fragments before sequencing, a necessary

VISIGEN BIOTECHNOLOGIES step for most Sanger sequencing, and this has resulted in the discovery of new microorgan- isms that previously had been missed because of cloning difficulties. The lack of cloning has also made determining relative numbers of microbes in the community easier. “As there is no cloning, we have very low bias,” says Egholm. Schuster and others have also noted less bias when applying 454 sequencing to their environmental samples. “What is very comforting to see is that a lot of tests have been done by different groups find- ing the same results with 454 sequencing and qPCR,” says Schuster. Egholm and his colleagues are using their sequencing platform for several large-scale metagenomics projects — the biggest of which actually originated by chance. “The biggest metagenomic project on Earth might be our Neanderthal genome project,” says Egholm. They are using 454 to sequence the complete genome of a Neanderthal, which Egholm says they hope to release by the end of the year. But 95–98% of the DNA in the Neanderthal sample VisiGen Biotechnologies is developing a FRET-based approach to single-molecule sequencing. comes from the environment rather than from

688 NATURE|Vol 453|29 May 2008 TECHNOLOGY FEATURE METAGENOMICS a Neanderthal. This means that to get the 1× phosphate. When one of coverage, or roughly 3 billion base pairs, of the the nucleotides is incor- REVEO genome, the team must sequence somewhere porated by the polymer- between 70 billion to 100 billion base pairs of ase into a native strand these environmental samples. of DNA, a unique FRET As Egholm’s team begins to comb through signal is given off, and the those environmental data contaminating pyrophosphate-contain- the Neanderthal samples, he says the initial ing fluorophore is then results have been surprising. “Whether these released leaving the synthe- are recent bacteria, or ones that ate the poor sized strand of DNA ready guy when he died, we cannot be certain yet, but for the next incorporation I can tell you from the first few contigs where event. By imaging the col- we got multi-kilobase lengths — they matched our changes as each incor- nothing in GenBank.” poration occurs, VisiGen hopes to sequence single The next, next generation molecules in real-time and Reveo is developing a single-molecule sequencing technology that “The biggest change in metagenomics will come apply this on a massively physically probes DNA. from ‘third generation’ sequencing systems or parallel array with an out- single-molecule sequencing,” says Schuster. put of up to one million bases per second. These new single-molecule sequencing For the metagenomics community this next- Reveo in Hawthorne, New York, is developing methods show the promise of generating next-generation promises longer reads than a method to sequence DNA using nano-knife- much longer read lengths in the future, says Sanger sequencing, even higher edge probes, which pass over DNA Schuster. VisiGen’s method predicts sequence throughput, lower costs and bet- that has been stretched and immo- reads of 1,500 base pairs — the size of 16S ter quantitation of genes. bilized in a channel 10 microme- ribosomal RNA — whereas the Pacific Bio- VisiGen Biotechnologies in tres wide. By using four different sciences approach could produce reads as long Houston, Texas, is working on nano-knife-edge probes, each as 10 kilobases. a method for sequencing sin- ‘tuned’ to a different frequency, And single-molecule sequencing is becom- gle molecules in real-time that Reveo hopes to non-destructively ing a reality: in March 2008, Helicos Bio- uses Förster resonance energy sequence DNA while eliminating Sciences of Cambridge, Massachusetts, sold transfer (FRET). In this system, the need for a costly imaging com- the first single-molecule sequencing system. the polymerase is engineered to ponent. And Pacific Biosciences of The HeliScope uses a sequencing-by-synthesis contain a donor fluorophore, Menlo Park, California, recently approach in which the DNA is first fragmented and each has a differ- announced its single-molecule into pieces 100–200 base pairs long. Adaptors ently coloured acceptor fluoro- Isidore Rigoutsos is sequencing technology based on of known sequence are attached to the ends of phore attached to the gamma testing genome assembly. zero-mode waveguides2. the fragments so that they can be captured on

THE HUMAN ENVIRONMENT “I don’t know why the Human viewed by many researchers in to isolate bacteria that are assess microbial diversity at the Microbiome Project bubbled the metagenomics community as currently cannot be cultured, various sites. to the top now instead of particularly timely as most agree development of new bioinformatic Peterson says that the protocols previously,” says Jane Peterson improvements in sample collection programs and tools for analysis for sampling and recruitment from the National Institutes of standards and analysis tools are of large genomic data sets, data have provided some early Health (NIH). “Certainly there much needed. Karen Nelson from analysis and coordination centres, challenges. The HMP hopes to have been metagenomic studies the J. Craig Venter Institute, who and analysing and understanding sample the same sites on all 250 done with the old technologies.” is also a participating investigator the ethical, legal and social issues individuals, but with so many But Peterson thinks recent in the HMP, says these issues of the project. sites, standardizing sampling reports exploring the human can finally be addressed with a The HMP’s initial sequencing can be tricky. “The protocol for gut microbial community, along project of this scope as all samples efforts began this year. Peterson the different sites has to be well with new, advanced sequencing will collected and treated in the says that towards the end of worked out,” she says, “the technologies, might have been same way, and standards will be the last fiscal year, the NIH oral community has to be happy enough to pique the interest of put in placefor annotation of the Roadmap office had funds with the requirements that the the reviewers who decided to fund metagenomic data. available for the HMP. Through skin community brings to the the Human Microbiome Project The project will fund research existing relationships with large table.” (HMP). in several areas, although the sequencing groups, the HMP Although it is just at the The project is a 5-year, US$115- construction of a data resource for was able to start quickly and beginning, the HMP is scheduled million effort to study the microbial sequencing DNA samples from generate preliminary sequencing to award its first rounds of grants communities inhabiting several as many as 250 individuals, and data, which are being used for the to researchers this autumn. regions of the human body, projects aimed at demonstrating other demonstration projects. Peterson says that in the future including the gastrointestinal how changes in the human This initial effort will result in the project’s standardization and female urogenital tracts, microbiome are related to health the sequencing of 200 new efforts might not be restricted oral cavity, nasal and pharyngeal and disease are centrepieces of bacterial organisms, recruitment to the United States. “We are tract, and skin, and how those the programme. Other project of patients for metagenomic also forming an international communities influence human initiatives include the development studies, and some 16S ribosomal consortium to coordinate health and disease. The effort is of new metagenomics technology RNA metagenomic sequencing to international projects.” N.B.

689 TECHNOLOGY FEATURE METAGENOMICS NATURE|Vol 453|29 May 2008

the surface of a coated flow cell. Once attached, a mix of labelled nucleotides and polymerase CDC is flowed through the cell and the surface is imaged at different times to determine whether labelled nucleotides have been incorporated. The key to the technology is a method to effectively cleave off the labelled nucleotides following incorporation, permitting additional rounds to be performed. “The moment metagenomics goes down to the single-molecule level, it will be possible to assess even very low abundance messages at a very large sequence interval,” says Schuster.

Information overload “People struggle at the moment with extract- ing basic information from metagenomic data,” says Peer Bork a bioinformatician at the European Molecular Biology Laboratory in Heidelberg, Germany. Bork and others say that most bioinformatic analyses being done on metagenonomic data sets involve relatively basic procedures such as assembly and gene annotation. There are several options for such proce- dures. Some use web-based tools, such as the Enterococcus species are an example of bacteria found in the human gastrointestinal tract that will be Rapid Annotations using Subsystems Tech- examined during the Human Microbiome Project. nology (RAST) server developed by Argonne National Laboratory and the University of Chi- Rigoutsos agrees. “How can you tell the phy- where standards are in place for data acquisi- cago, in which a metagenomics data set is sub- logenetic provenance of a sequence segment tion, generation and analysis,” says Nelson. mitted and an analysis file is returned. Others, if the databases contain no examples of it?” such as the Metagenome Analyzer (MEGAN) he asks. He sees the situation as being akin to Dynamic future program, developed by Schuster and his col- the early days of human genomics when some With bioinformatics tools and sequencing leagues, run on a desktop computer. gene-prediction programs relied on database poised to go even faster at a lower cost, “I think being able to handle these large data searches to draw conclusions. His group has researchers are eyeing the next level of sets and developing tools for visualization are developed PhlyoPythia, a software tool that metagenomic analysis — a move from simply critical, as they will allow researchers to write also takes a binning approach to classifying cataloguing the microbes in a community to meaningful publications,” says Schuster. He sequence contigs assembled from metagen- understanding the interactions and dynamics explains that MEGAN uses a binning format omic data sets. of the organisms. based on the National Center for Biotechnology Annotation is not the only standardiza- “I think a systems-level approach that gives Information’s taxonomy tion issue. “We need to a holistic view of these communities will open database to display how do a better job of collect- different research possibilities,” says Rigout-

F. ROHWER F. often a certain taxon occurs. ing information when sos, whose group is now working on studying Other efforts, including we take samples,” says metagenomes for biofuel applications. the Community Cyber- Rohwer. The time, place Rohwer hopes to use metagenomics to infrastructure for Advanced and collection method manipulate a system and then trace how the Marine Microbial Ecology can profoundly affect the community goes through changes at a global Research and Analysis, or microbial composition in a level. He thinks that with sequencing speed and CAMERA, aim to bring sample. Whether the sam- cost declining so rapidly, these experiments, together bioinformatics ple is collected during the which only a few years ago would have been resources and a data reposi- day or at night, acquired impossible, are now within reach. tory to assist with analysis of from a person’s right or Bork also wants to move towards analysing the data sets. Forest Rohwer studies phage the left arm, or even if two global communities. “Most groups concentrate One issue with the vari- distribution in environmental samples. soil samples are collected on the early parts, but I think it is time to ask ous tools available, notes 5 millimetres apart, these the bioinformatics people to develop methods Rohwer, is that none has been adopted as stand- seemingly small differences can lead to very beyond there, to get ecology concepts projected ard. “It is still pretty much a cottage industry,” different communities of organisms. Rohwer on those molecular data.” he says. thinks that collecting a standard set of infor- Nelson says that all the technology develop- “Standardization of annotations and gene- mation for each sample would make future ment and interest in metagenomics is spurring prediction quality during the early stages of comparisons between different data sets easier the field along nicely. “I think the technology is metagenomic studies is something that needs to and so provide greater biological insight. very promising and it will get better. It is a great be addressed,” agrees Nelson. She adds that it is There is hope that some of these issues will time to be doing this.” ■ now much easier to generate data than to inter- be addressed as members of the metagenomics Nathan Blow is the technology editor for pret what they mean. “In a number of situations community become increasingly involved with Nature and Nature Methods. we are dealing with unknown species that have large-scale, multi-institution projects (see ‘The 1. Mavromatis, K. et al. Nature Methods 4, 495–500 (2007). not had their genomes sequenced, so there will human environment’). “The Human Microbi- 2. Korlach, J. et al. Proc. Natl Acad. Sci. USA 105, 1176–1181 not be a reference genome to align to.” ome Project is going to be one good example (2008).

690