Metagenomic Binning Through Low-Density Hashing
Total Page:16
File Type:pdf, Size:1020Kb
Metagenomic binning through low-density hashing The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Lou, Yunan, et al. "Metagenomic binning through low-density hashing." Bioinformatics 35, 2, (January 2019): 219–226 © 2018 The Author(s) As Published http://dx.doi.org/10.1093/bioinformatics/bty611 Publisher Oxford University Press (OUP) Version Final published version Citable link https://hdl.handle.net/1721.1/122806 Terms of Use Creative Commons Attribution NonCommercial License 4.0 Detailed Terms https://creativecommons.org/licenses/by-nc/4.0/ Bioinformatics, 35(2), 2019, 219–226 doi: 10.1093/bioinformatics/bty611 Advance Access Publication Date: 13 July 2018 Original Paper Genome analysis Metagenomic binning through low-density Downloaded from https://academic.oup.com/bioinformatics/article-abstract/35/2/219/5053315 by MIT Libraries user on 07 November 2019 hashing Yunan Luo1,†, Yun William Yu2,4,†, Jianyang Zeng3, Bonnie Berger4,* and Jian Peng1,* 1Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL 61801, USA, 2Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA, 3Machine Learning and Computational Biology Group, Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China and 4Department of Mathematics and Computer Science and AI Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA *To whom correspondence should be addressed. †The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint First Authors. Associate Editor: Inanc Birol Received on April 8, 2018; revised on June 18, 2018; editorial decision on July 4, 2018; accepted on July 10, 2018 Abstract Motivation: Vastly greater quantities of microbial genome data are being generated where envir- onmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce ‘low-density’ locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning. Results: On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bio- informatic challenges. Availability and implementation: Full source code and datasets are available at http://opal.csail. mit.edu and https://github.com/yunwilliamyu/opal. Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. 1 Introduction these questions are vastly more complex than standard genomic datasets due to the mixing of unknown amounts of different Metagenomics is the study of the microbiome—the many genomes genomes present. These data result in major identification challenges (bacterial, fungal and even viral) that make up a particular environ- for certain bacterial, as well as viral, species, strains and genera ment. The microbiome has been linked to human health: soil sam- (Janda and Abbott, 2007; Tu et al., 2014). ples can lead to the discovery of new antibiotics (Forsberg et al., We focus on whole-genome metagenomic DNA sequencing, 2012), and the human gut microbiome has been linked to Crohn’s since cheaper Amplicon-based sequencing methods, which concen- Disease (Erickson et al., 2012), obesity (Turnbaugh and Gordon, trate on the diversity of given marker genes (e.g. the 16S rRNA 2009) and even autism spectrum disorder (MacFabe, 2012). gene) and only analyze protein-coding regions, are limited in their Metagenomics fundamentally asks what organisms are present in a ability to provide microbial functions from the samples (1000 genomic sample with the goal of gaining insight into function. Genomes Project Consortium, 2012; Altschul et al., 1990; Berlin However, the sequencing datasets required to shine any light on et al., 2015). Unfortunately, the mixing of DNA from many VC The Author(s) 2018. Published by Oxford University Press. 219 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] 220 Y.Luo et al. different, sometimes related organisms in varying quantities poses et al., 2015; Keich et al., 2004; Ma et al., 2002). Spaced-seeds are an substantial computational and statistical challenges to metagenomic example of LSH, but are generally high-density because they have binning, the process of grouping reads and assigning them to an ori- historically been designed to optimize for the probability of at least gin organism. This important first step occurs before downstream one hit over the expected number of hits in a pair of homologous data analysis can be applied to elucidate the structure of microbial but non-identical regions. The original PatternHunter, which intro- populations and assign functional annotations (1000 Genomes duced spaced-seeds to bioinformatics used 11 locations in a 19-mer Project Consortium, 2012; Bromberg and Rost, 2007). Existing se- (Keich et al., 2004; Ma et al., 2002). Brinda et al. (2015) used quence alignment tools, such as BWA-MEM (Li, 2013), Bowtie 2 spaced-seeds in Kraken for metagenomic binning, but optimized for (Langmead and Salzberg, 2012) or BLAST (Altschul et al., 1990), hit count and coverage, using e.g. 16 locations in a 28-mer, or 24 Downloaded from https://academic.oup.com/bioinformatics/article-abstract/35/2/219/5053315 by MIT Libraries user on 07 November 2019 can readily be used and usually provide high-resolution alignments locations in a 36-mer. Employing a ‘high density’ of over half the and accurate results by simply finding the highest scoring matching locations in a k-mer can decrease spurious matches, necessary when genome; they have the added advantage of tolerance to small num- using a discriminative k-mer framework like that of Kraken, which bers of mismatches or gaps. However, the computational cost of can only distinguish between two species when a k-mer in the read alignment-based methods becomes prohibitive as metagenomic matches one species but not another. However, this approach still datasets continue to grow (Wood and Salzberg, 2014; Yu et al., runs into the same exponential space problem of large k-mer sizes 2015a). Similarly, contig-assembly based methods, while effective, when used in a ML framework because most of the positions in the are also computationally expensive (Alneberg et al., 2014). read still contribute to the dimensionality (Section 2). Thus, using Alternatively, the field has turned to alignment-free metagenomic lower density hashing is necessary, but it comes at the cost of not binning (also known as compositional binning) (Nawy, 2015), which capturing long discriminative k-mers. Fortunately, discriminative k- assigns sequence fragments to their taxonomic origins according to mers are not needed in SVM-based models such as the one we use, specific patterns of their constituent k-mers. State-of-the-art tools such as well as other ML-based models, so even using lengths as short as as Kraken (Wood and Salzberg, 2014)andCLARK(Ounit et al., 6–12 positions can be sufficient. Thus, low-density hashing is 2015) use exact occurrences of uniquely discriminating k-mers in enabled by SVM-based models, and furthermore, allows those mod- reads (i.e. k-mers that appear in species A but not species B) and are els to access longer-range correlations that cannot be captured using very efficient, but as a result are limited in both their sensitivity and contiguous short k-mers. To the best of our knowledge, low-density ability to detect unknown organisms. Alternately, Kaiju (Menzel hashing—using significantly fewer than half the locations in a k-mer et al.,2016) uses matches of reads to a protein database instead of ref- (e.g. 25% by default in the method we introduce)—has not previ- erence genomes for the same purpose. Other approaches rely on ously been used for bioinformatic analysis. supervised machine learning (ML) classifiers, such as Naive Bayes or Here we newly overcome these bottlenecks in handling long k- support vector machines (SVMs), trained on a set of reference genome mers by developing a novel compositional metagenomic binning algo- sequences to classify the origins of metagenomic fragments (Wang rithm, Opal, which efficiently encodes long k-mers using low- et al.,2007; Patil et al.,2011; Vervier et al.,2016; Brady and dimensional profiles generated using even-coverage, low-density hash- Salzberg, 2009) using the relative k-mer frequency vector of a read. ing. We take inspiration from low-density parity check (LDPC) error More recently, latent strain analysis performs covariance analysis of correcting codes (also known as Gallager codes)—not previously used k-mers to partition reads for low-abundance strain assembly and de- in bioinformatics—to