<<

TECHNOLOGY FEATURE

Microbiology: the road to strain-level identification Vivien Marx

Tools are emerging to help labs trawl for sequences that reveal microbial strains and their functional potential in deep pools of metagenomic data.

‘Who is there’ and ‘what are they doing’ are coli (STEC O26) caused food poisoning typical questions in military intelligence and in customers of a particular restaurant in metagenomics. Both fields sift through chain. In the summer of 2011, the food- big data for signals. Speed is crucial when borne pathogen enterohemorrhagic E. coli averting terrorist threats or hunting the (EHEC) O104:H4 led to illness and deaths cause of food contamination or an infec- in Germany. tious disease outbreak. In calmer times, both Successful strain-level identification is communities want to keep learning about part of a larger metagenomics trend. Over their subjects of interest. Both communi- the past five to ten years, scientists have ties also benefit from sophisticated tools. continuously improved the resolution of Metagenomics researchers analyze genetic their data analysis, says Nicola information from microbes in different envi- Segata, a metagenomics methods developer ronments, including the human body, using at the University of Trento, Italy. Scientists high-throughput sequencing and computa- kept moving down the taxonomic ranks as tional methods. they distinguished phyla, families, genera In metagenomics, researchers might and . The discovery of microbial M. Rohde, Helmholtz Center for Research Infection for Center Helmholtz M. Rohde, analyze sequencing reads to find out how diversity below the species level led them the gut microbiome differs between indi- further still. “Differences in different strains of one species can appear alike, but different strains can actually differ markedly from

Nature America, Inc. All rights reserved. America, Inc. © 201 6 Nature viduals with and without Crohn’s disease; of the same species are crucial for a lot of others capture the teeming bacterial diver- tasks,” says Segata. one another and have different functions. sity in samples of soil removed at different It has been difficult to extract strain-level times from the same location or in ocean insight from short-read sequence data. in the entire metagenome sample. In con- npg water samples taken at varying time points Now it’s becoming more routinely possible trast, strain-level reconstructions shed or depths. Analyses reveal slews of invis- to extract not just species but also light on which genomic properties, such ible microbial species. But addressing the strains from the metagenome data con- as particular gene functions or even single- question ‘who is there’ calls for more than tained in multiple samples, says Christopher nucleotide polymorphisms, are distinctive species-level identification: researchers want Quince, a microbiome researcher and meth- for particular microbial and pheno- to identify microbes on the strain level. ods developer at University of Warwick types, says McHardy. Such reconstructions Some, but not all, microbial strains spell Medical School. help scientists explore metabolic changes in trouble. US officials recently determined Within the same species, microbes can strains over time or under varying condi- that Shiga-toxin-producing Escherichia differ genetically and perform quite differ- tions. Strain-level information could con- ent functions. Such differences might help tribute to research about pathogen–host to explain the emerging contradictions interaction, she says, which can help with in the burgeoning microbiome literature. treatment decisions and improve under- Different labs find varying types of asso- standing of a pathogen’s . Analysis ciation of microbial phyla with certain can reveal microevolution aspects and phe- diseases, says Alice McHardy, a computa- nomena such as horizontal gene transfer or tional biologist at the Helmholtz Center mutational hot spots. Coser, University of Trento of University Coser,

for Infection Research in Braunschweig, The field’s progress has led to microbi- Germany. Strain-level analysis may help ome-focused biotech ventures, and larger Alessio resolve these seeming contradictions. companies are taking notice. What he finds Analysis of both DNA and RNA gives a fuller view One method, the ‘bag of genes’ approach, exciting, says Dirk Gevers, who directs of a microbial community, says Nicola Segata. looks at the functions encoded by all genes the Janssen Institute

NATURE METHODS | VOL.13 NO.5 | MAY 2016 | 401 TECHNOLOGY FEATURE

specific in-house expertise. Gevers was pre- Shotgun metagenomic sequencing lets viously at the Broad Institute of Harvard and scientists explore ‘who is there’. A sample’s MIT and worked on the National Institutes jumble of microbial DNA is sequenced and of Health Human Microbiome Project, a analyzed to shed light on taxonomic iden- large-scale project devoted to characteriz- tities. A sample from a person’s gut inevi- ing the human microbiome’s role in health tably includes bacterial, fungal, archaeal, and disease1. Products based on microbes viral and human DNA, says Segata. Add cannot be an indiscriminate mix; they will this to the overall microbial diversity and contain specific strains selected for desired it’s a seemingly intractable analytical chal- qualities and functions, says Gevers. Strain- lenge. Quince offers a thought experiment level insight is only slowly emerging; new to describe how labs assign sequence reads analytical technologies are starting to to the strains from which they came (see change that. Box 1, ‘What the monk saw’). Computational tools, sequencing tech- Who’s there? nology with longer reads, and lower error Amplicon sequencing has long helped rates will let scientists move toward com- researchers analyze samples both of culti- prehensively characterizing all microbes Janssen/Erin Dewalt, Nature Publishing Group Group Publishing Dewalt, Nature Janssen/Erin vable microbes and of those that fail to grow in samples at the strain level from shotgun in the lab. The technique leverages small dif- metagenomes, says McHardy. (JHMI), is that new types of microbiome- ferences in a distinctive taxonomic marker: What motivates the community is that based products could be more natural inter- the small-subunit ribosomal RNA (16S) strain-level identification connects sequenc- ventions; they might be used just as disease gene locus that is highly conserved across ing studies based on human clinical samples develops and could potentially deliver more microbes. 16S rRNA sequencing remains a to functional testing in animal models, says complex and subtle signals to the host. “powerful and cost-effective tool” for many Julie Segre, an investigator at the National The JHMI, which is part of Janssen experiments, says Segata, yet it is not as use- Human Research Institute. She is Research and Development, one of the ful for studying aspects in which strain-level also part of the Human Microbiome Project, Janssen Pharmaceutical Companies of differences matter. In his view, and especial- now in its second phase. “You can’t take a Johnson & Johnson, is pursuing collabora- ly for microbes that cannot be cultivated in bunch of base pairs of DNA from a sequenc- tions with entrepreneurs and with academic the lab, shotgun metagenomics is “the only er and use them to colonize a mouse,” says and clinical labs as it ramps up microbiome- way to go for strain-level profiling.” Segre. That would not successfully create a

BOX 1 WHAT THE MONK SAW: A METAGENOMIC TALE

Dramatis personae: libraries: microbial samples; book: a each color, he notices Nature America, Inc. All rights reserved. America, Inc. © 201 6 Nature microbial species; book versions: strains; words and letters: that some reconstructions genomic data; reconstructions: continguous genomic segments share a color pattern. (contigs); shared colors: metagenomic binning, which uses He surmises that npg shared frequencies across samples in order to assign contigs to reconstructions with the species or strains. same color pattern all A bibliophile medieval king wants to survey Europe’s libraries came from the same book,

by copying all the books. Some libraries hold many copies of one as each book has a unique Getty Images Kindersley/ Dorling title, such as the Bible or Aesop’s Fables. The king seeks out a set of frequencies across monk who is a renowned, fast copyist, albeit a little unfocused. libraries. He is able to link The monk travels far and wide to monastery libraries. He the reconstructions to yield the book contents, even though randomly pulls out a book and copies 50 words onto a parchment the order of the partial reconstructions is still jumbled. snippet. He repeats this task a million times, sometimes with the As the years pass, the monk realizes the copyists made same, sometimes with a different book. He puts the snippets in small errors, even with the Bible. Words were changed, new a sack and travels onward. Over time, he enlists other copyists to passages were added, perhaps deliberately. Intrigued, he sets help with the copying tasks. out to find all modified books across Europe. Again, color The monk returns home, empties the sacks and sits amid patterns save the day: he can link changed words together mountains of paper snippets. He matches them up, finds overlap by following the pattern of colored fragments associated with and reconstructs a few longer texts. But he is stymied by each change. This approach allows him to reconstruct all identical book sections. He can’t tell which book they came from the variants of an existing book. The monk—now gray, but when the copied text on the paper pieces is shorter than the limber—asks to see the king, who is overjoyed to hear the longest, repeated book text segments. There is seemingly no way survey is complete. The king holds a week-long feast in the to join the many partial reconstructions. But he has a plan. monk’s honor. In each library he used parchment of a different hue. When Source: Christopher Quince, University of Warwick Medical he counts how many times each reconstruction appears in School.

402 | VOL.13 NO.5 | MAY 2016 | NATURE METHODS TECHNOLOGY FEATURE

BOX 2 COMPETITIVE BUT SENSITIVE: TOOL BENCHMARKING IN METAGENOMICS

Alice McHardy from the Helmholtz Center for Infection The team reached out to labs that chose not to participate and Research, Alexander Sczyrba at the University of Bielefeld inquired about which tool settings to use in the competition. and Thomas Rattei at the University of Vienna launched the “This was really well received by some,” she says. Critical Assessment of Metagenome Interpretation (CAMI) When CAMI results are published, McHardy knows some labs competition, and evaluation of the first CAMI challenge is might be disappointed by their software’s performance. The plan under way. One CAMI category assesses tools that handle is to highlight the pros and cons of tools for different questions assembly, and another assesses taxonomic profiling. A third and scenarios, she says. Her lab has two tools in the competition, category assesses binning tools, which assign individual and because analysis is anonymized, she does not know how sequence reads into bins that ideally hold strain-level either is faring. “We will see how all of this will work out—in a information but might contain other taxonomic categories. few months we will know more,” she says. Evaluations are sensitive business. Scientists all too She hopes that developers will continue to integrate their often discover that another lab evaluated their tool in an tools into the CAMI platform’s so-called docker containers. This unfavorable way, says McHardy. They might have used the integration could help to semiautomatically benchmark these simplest settings or ones likely to deliver a suboptimal software tools in the future. In conversations with colleagues performance. With competitions, developers can submit who run the Critical Assessment of Protein Structure Prediction their tools and explain how they like them to be used. The (CASP) she heard that such regular benchmarking accelerated the community can help decide how performance should be protein prediction field. “It would be great if we can make that assessed. happen for meta’omics as well,” she says.

model of a bacterial infection by a specific that had been considered universally pres- binning them in gene-family clusters. This strain. Strain-level identification, she says, ent in bacteria. is how the team calculated pangenomes for connects scientists to clinical over 400 species. PanPhlAn characterizes and to up-to-date medical knowledge. Compute this strains by identifying which combinations Strain-level identification from shotgun Mining metagenomes is computationally of genes, from the pangenome of each spe- data offers many opportunities to explore demanding, but solutions have emerged cies, are found in a sample. the functional capacity of strains, the ‘what and more are coming, says Segata. As it Segata’s team tested the tool and discov- are they doing’ question. Researchers want stands, software tools can deliver “noisy,” ered, for example, that some tested Chinese to learn which genes underlie traits of inter- partially incomplete results. At the same individuals harbor a set of Eubacterium rec- est, says Segre. time, he says, “it is just fantastic that this is tale strains that are genomically distinct from Quince is happy about the emerging possible at all.” the same species found in the gut of tested strain-level analysis options and says that “it During his postdoctoral fellowship in the people in Europe and the United States.

Nature America, Inc. All rights reserved. America, Inc. © 201 6 Nature has practically made single-cell sequencing lab of Curtis Huttenhower at the Harvard Segata believes many more such patterns can redundant.” His software tool CONCOCT School of Public Health, Segata developed be found in metagenomic data, including takes a pile of short reads, assembles them MetaPhlAn and MetaPhlAn2, which lever- some that correlate with disease risk. into fragments and can assign the fragments age a genetic signature to identify and Using PanPhlAn, researchers can also npg to the organism from which the genetic track species, and in many cases strains. In obtain transcriptional profiles of a micro- information came2. Other tools work simi- practice, that can mean following around bial community. Knowing which genes are larly and, he says, “it is the principle that is 200 genes per species, and a sample might present in a sample speaks only to potential more important than any one algorithm.” contain thousands of species, such that the organismal functions; a bacterium with a He has extended this software in De Novo tools dig into around one million genes. virulence gene that is not expressed might Extraction of Strains from Metagenomes, But, he says, “it’s like distinguishing lions not be a troublemaker. PanPhlAn, with its or DESMAN, which infers strains and their and tigers from their footprints: it works, analysis of DNA abundances using patterns of nucleotide fre- it’s great, but it does not tell you much about and RNA, gives quencies. The software then determines the the biology of lions and tigers.” a fuller view of a nonshared contigs that are unique to indi- More recently, he and his Trento team, microbial commu- vidual strains. Such tools help researchers to along with colleagues at the University of nity, says Segata. begin to extract biologically relevant infor- Massachusetts Medical School and the A computer sci- mation directly from shotgun metagenome Perinatal Institute in Ohio, have developed entist who gravi- reads. He points to a recent study by Jillian Pangenome-based Phylogenomic Analysis tated toward biol- Banfield at the University of California, (PanPhlAn)4, which he finds more power- ogy, Segata is at his

Berkeley, and her team, as one that shows ful given that it reveals the full gene reper- university’s Center Research Infection for Center Helmholtz the strength of combining metagenomic toire of strains. for Integrative 3 Being a tool developer and functional analysis . For example, the The pangenome is a catalog of genes for Biology. One- in metagenomics can group reports finding “unusual biology” in all sequenced strains in a species of inter- third of his team become fun again and bacteria, including an unusual ribosomal est, says Segata. It’s built by extracting genes does wet-lab work, more productive, says structure and a lack of ribosomal proteins from available reference genomes and then and the others are Alice McHardy.

NATURE METHODS | VOL.13 NO.5 | MAY 2016 | 403 TECHNOLOGY FEATURE

The unit of microbial action is a strain, not a species, he says. Being able to differenti- ate between different strains matters greatly when scientists work on ways to intervene with a microbial community by introducing new strains, he says. The next big challenge is understand-

www.giantmicrobes.com ing the dynamics of an ecosystem, such as the bustling microbiome in the human gut, and doing so at high resolution, he says. Dissecting a community into its different strains, understanding the differences in genetic content and knowing which strains are expressing which genes or producing which metabolites, secreting which pep- Coser, University of Trento/ GIANTmicrobes GIANTmicrobes Trento/ of University Coser,

tides or proteins, “will be eye opening,” says Gevers. Spatial distribution might also mat- Alessio ter, he says. Are strains one big mix, or, he In the Segata group, one-third of the team does wet-lab work and the others are computational asks, is there a “ ‘method to the madness’ ”? scientists focusing on biology. Emerging techniques reveal the many different strains of a particular species in a ­computational scientists working on bio- published per month in 2015. The situa- sample, strain stability across different sam- logical problems. He integrates the group tion can be confusing when labs use dif- ples and, ultimately, differences in genetic for discussions and tool testing. A different ferent benchmark settings or data sets that or functional composition within a spe- kind of integration—tool integration—will, might seem artificial. Along with colleagues, cies across samples. Gevers looks forward Segata hopes, begin in the metagenomics she created a community approach to tool to developments that link complete genetic community. Integration means, for exam- benchmarking (see Box 2, ‘Competitive but content to each strain and measure each ple, having on hand data standards for tool sensitive’). The hope is that analysis results strain’s activity. He is hopeful about progress input and output. But such standards are will also be representative of real-life ques- beyond capturing the genetic composition of hard to establish in a rapidly changing field, tions, she says. “I think being a tool developer a microbial community, bringing strain-level and, he says, grant funders sometimes do in metagenomics can become fun again, and analysis and resolution to RNA, metabolite not view tool-integration projects as favor- more productive,” she says. and protein levels. ably as projects about new tools. Along the way to better tools, the com- Segata and many of his colleagues hope Another community issue is tool bench- munity has some fundamentals to discuss, sequencing costs will continue to drop,

Nature America, Inc. All rights reserved. America, Inc. © 201 6 Nature marking. McHardy says she and many such as the definition of a strain, says Segata. allowing labs to improve the sequencing other labs spend much time on perfor- For example, two organisms might differ on depth with which they probe . mance benchmark- average in one nucleotide every 50,000 bases, Matters get complicated when there are ing. She began her and researchers will disagree on whether closely related strains in a sample, or when npg metagenomics these two organisms are different strains. strains are present in amounts too low to journey developing The number of such differences is likely to detect or below the level of sequence cover- tools—for exam- vary depending on the species. There is no age needed for their reconstruction. Such ple, PhyloPythia, threshold with biological meaning, he says, challenges motivate him and others to keep which is a super- that defines the limit between ‘same strain’ refining computational methods. vised taxonomic and ‘different strain’. 1. The Human Microbiome Project Consortium binner. Software Nature 486, 207–214 (2012). A long view 2. Alneberg, J. et al. Nat. Methods 11, 1144–1146 Janssen tools in this class have exploded, she High-resolution analysis in metagenomics (2014). 3. Brown, C. et al. Nature 523, 208–211 Understanding an says, and the devel- is a tall order, but strain-level insights are ecosystem, such as the (2015). oper community promising, says Gevers. This view can show 4. Scholz, M. et al. Nat. Methods 13, 435–438 human gut’s bustling (2016). microbiome, at high has grown, too. She organisms to be stable residents of an envi- resolution is the next counts as many as ronment for several years or longer. Within a big challenge, says one new metage- habitat, different species are competing, and Vivien Marx is technology editor for Dirk Gevers. nomic analysis tool so are different strains of the same species. Nature Methods ([email protected]).

404 | VOL.13 NO.5 | MAY 2016 | NATURE METHODS