A New Tool for De Novo Extraction of Strains from Metagenomes Christopher Quince1* , Tom O
Total Page:16
File Type:pdf, Size:1020Kb
Quince et al. Genome Biology (2017) 18:181 DOI 10.1186/s13059-017-1309-9 METHOD Open Access DESMAN: a new tool for de novo extraction of strains from metagenomes Christopher Quince1* , Tom O. Delmont2, Sébastien Raguideau1, Johannes Alneberg3, Aaron E. Darling4, Gavin Collins5,6 and A. Murat Eren2,7 Abstract We introduce DESMAN for De novo Extraction of Strains from Metagenomes. Large multi-sample metagenomes are being generated but strain variation results in fragmentary co-assemblies. Current algorithms can bin contigs into metagenome-assembled genomes but are unable to resolve strain-level variation. DESMAN identifies variants in core genes and uses co-occurrence across samples to link variants into haplotypes and abundance profiles. These are then searched for against non-core genes to determine the accessory genome of each strain. We validated DESMAN on a complex 50-species 210-genome 96-sample synthetic mock data set and then applied it to the Tara Oceans microbiome. Keywords: Metagenomes, Strain, Niche Background become uncertain and fragment into multiple contigs [4]. Metagenomics, the direct sequencing of DNA extracted Metagenomes contain many conserved regions between from an environment, offers a unique opportunity to strains. These act effectively as inter-genome repeats, and study whole microbial communities in situ. The major- hence produce highly fragmented assemblies. This is par- ity of contemporary metagenomics studies use shotgun ticularly true when multiple samples are co-assembled sequencing, where DNA is fragmented prior to sequenc- together. It is possible to bin these contigs into parti- ing with short reads, of the order of hundreds of base tions that derive from the same species using sequence pairs (bps). To realise the potential of metagenomics fully, composition [5, 6] and more powerfully, the varying methods capable of resolving both the species and the coverage of individual co-assembled contigs over multi- strains present in this data are needed. Reference-based ple samples [7–10]. However, the resulting genome bins, solutions for strain identification have been developed or metagenome-assembled genomes (MAGs), represent [1, 2] but for the vast majority of microbial species, com- aggregates of multiple similar strains. These strains will prehensive strain-level databases do not exist. This situ- vary both in the precise sequence of shared genes, when ation is unlikely to change, particularly given the great that variation is below the resolution of the assembler, but diversity of microbes that are elusive to standard culti- also in gene complement, because not all genes and hence, vation techniques [3]. This motivates de novo strategies contigs will be present in all strains. capable of resolving novel variation at high resolution Modified experimental approaches can be used to sim- directly from metagenomic data. plify the challenge of metagenomics assembly by reduc- It is not usually possible simply to assemble metage- ing individual sample complexity, for example through nomic reads into individual genomes that provide strain- enrichment cultures that preferentially grow organisms level resolution. This is because in the presence of repeats adapted to particular growth conditions [11] or with (identical regions that exceed the read length), assemblies potentially less bias by selecting small subsets of cells using flow cytometry and sequencing with low-input *Correspondence: [email protected] DNA techniques [12]. The latter has been coupled with 1Warwick Medical School, University of Warwick, Gibbet Hill Road, CV4 7AL the sequencing of a standard whole-community metage- Coventry, UK nomics sample in a novel binning pipeline, MetaSort Full list of author information is available at the end of the article [13], which exploits the assembly graph to map the flow © The Author(s). 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Quince et al. Genome Biology (2017) 18:181 Page 2 of 22 cytometry sample sequences onto those from the commu- which strain. From the analysis of core genes, we know nity metagenome and to extract genomes. However, for how many strains are present and their relative abun- the majority of studies that do not perform enrichment or dances across samples. The signature of relative frequen- flow cytometry, improved bioinformatics algorithms will cies across samples associated with each strain will also be required to resolve strain variation from metagenome be observed on the non-core gene variants but, crucially, data sets. not all strains will possess these genes and potentially A number of methods exist that map reads against ref- they may be in multiple copies. The relative strain fre- erence genes or genomes to resolve strain-level variation quencies have to be adjusted, therefore, to reflect these de novo [14–16]. The most straightforward approach is copy numbers. For instance, if a gene is present in just a to take the consensus single-nucleotide polymorphisms single copy in one strain, it can have no variants. In addi- (SNPs) in individual samples to be the haplotypes [16, 17]. tion, the total coverage associated with a gene will also This cannot, however, resolve mixtures and will entirely depend on which strains possess that gene being a sim- miss strains that are not dominant at least somewhere. ple sum of the individual strain coverages. Here, we do These shortcomings can be addressed by using the fre- not address the multi-copy problem, just gene presence or quency of the variants across multiple samples to resolve absence in a strain. We infer these given the observed vari- de novo strain-level variation and abundances. This is the ant base frequencies and gene coverages across samples approach taken in the Lineage algorithm of O’Brien whilst keeping the strain signatures fixed at those com- et al. [14] and ConStrains [15]. However, no method puted from the SCSGs and SCGs. This also provides a has yet been developed that works from assembled con- strategy for inferring non-core gene haplotypes on strains. tigs, avoiding the need for any reference genomes, and, Taken together, these two steps provide a procedure for hence, is applicable to microbial populations that lack cul- resolving both strain haplotypes on the core genome and tured representatives. Here, we show that it is possible their gene complements entirely de novo from short-read to combine this principle with contig-binning algorithms metagenome data. We recommend applying this strat- and resolve the strain-level variation in MAGs, both in egy to genes, but crucially genes called on the assem- terms of nucleotide variation on core genes and variation bled contigs. If contig assignments are preferred, the in gene complement. same methodology could be applied directly to the con- We denote our strategy DESMAN forDenovoExtrac- tigs themselves, or a consensus assignment of genes tion of Strains from Metagenomes. We assume that a on a contig used to determine its presence or absence co-assembly has been performed and the contigs binned in a given strain. The DESMAN pipeline is summarised into MAGs. Any binning algorithm could be used for this, in Fig. 1. but here we applied CONCOCT [9]. We also assume that The advantage of using base frequencies across samples reads have been mapped back onto these contigs as part to resolve strains, rather than existing haplotype resolu- of this process. To resolve strain variation within a MAG tion algorithms that link variants using reads [18], is that or group of MAGs deriving from a single species, we first it enables us to resolve variation that is less divergent than identify core genes that are present in all strains as a sin- the reciprocal of the read length and to link strains across gle copy. In the absence of any reference genomes, these contigs. The intuition behind frequency-based strain will simply be those genes known to be core for all bacteria inference is similar to that of contig binning. The frequen- and archaea (single-copy core genes or SCGs), e.g. the 36 cies of variants associated with a strain fluctuate across clusters of orthologous groups of proteins (COGs) iden- samples with the abundance of that strain. However, in tified in [9]. If reference genomes from the same species this case it is necessary to consider that multiple strains or related taxa are available, then these can be used to may share the same nucleotide at a given variant position. identify further genes that will satisfy the criteria of being To solve this problem, we develop a full Bayesian model, presentinallstrainsinasinglecopy,inwhichcasewe fitted by a Markov chain Monte Carlo (MCMC) Gibbs denote these as single-copy core species genes (SCSGs). sampler, to learn the strain frequencies, their haplotypes Using the read mappings, we calculate the base frequen- and also sequencing error rates. To improve convergence, cies at each position on the SCSGs or SCGs. Next, we we initialise the Gibbs sampler using non-negative matrix determine variant positions using a likelihood ratio test factorisation (NMF), or more properly non-negative ten- applied to the frequencies of each base summed across sor factorisation (NTF), a method from machine learning samples. We then use the base frequencies across samples that is equivalent to the maximum likelihood solution on these variant positions to resolve the number of strains [19]. Our approach is like the Lineage algorithm devel- present, their abundance and their unique sequence or oped by O’Brien et al.