Rapid Phylogenetic and Functional Classification of Short Genomic
Total Page:16
File Type:pdf, Size:1020Kb
Berendzen et al. BMC Research Notes 2012, 5:460 http://www.biomedcentral.com/1756-0500/5/460 RESEARCH ARTICLE Open Access Rapid phylogenetic and functional classification of short genomic fragments with signature peptides Joel Berendzen1, William J Bruno2, Judith D Cohn3, Nicolas W Hengartner3, Cheryl R Kuske4, Benjamin H McMahon2*, Murray A Wolinsky4 and Gary Xie4 Abstract Background: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers. Results: At even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signature peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database. Conclusions: Classification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions. Background of sequences is high [1,2] and where sequences from As of this writing, DNA sequencers routinely produce close relatives are not to be found in reference databases more than 2 Gbp of data per hour, with the high-quality (e.g. [3]). Insight into microbial communities and their region of reads as short as 75 bp. Analytical methods dynamics would be desirable for a number of import- that can keep up with this flow rate are urgently needed. ant applications in medicine, agriculture, ecology, and Analysis is especially difficult for shotgun metagenomics industry [4]. data from environments such as soil where the diversity The first step in most sequence analyses is finding a suitable answer to the question, “How close is this se- * Correspondence: [email protected] quence to something seen before?”. The notion of close- 2Theoretical Division, MS K710, Los Alamos National Laboratory, Los Alamos, ness implied in the question is a phylogenetic distance, NM 87545, USA Full list of author information is available at the end of the article which is most properly answered by a phylogenetic © 2012 Berendzen et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Berendzen et al. BMC Research Notes 2012, 5:460 Page 2 of 21 http://www.biomedcentral.com/1756-0500/5/460 algorithm. Unfortunately the computational expense of homologous regions of genes for the purposes of con- such algorithms, coupled with the intractability of mak- structing a multiple sequence alignment [23,24]. In this ing the relevant alignments and trees for genes that may work, we are considering higher values of k, in order to have large numbers of paralogs, make this approach in- identify homologous genes by comparing entire bacterial feasible at present except for a small fraction of gene genomes. We begin by comparing the genomes of two families. The most common alternative is to find a proxy divergent bacteria and observing that random matches for phylogenetic distance in a more-readily-computed dominate for k < 8, while for k = 10, an average of only sequence similarity score as produced by the program about one random match is expected between the gen- BLAST [5] and its relatives. Yet the relationship between omes. Such 10-mer matches are evidently long enough sequence similarity and phylogenetic distance is skewed to specifically discriminate a portion of a conserved gene by rates of acceptance of mutations that can range over from other genes or organisms, while sufficiently short many orders of magnitude over a length scale of tens of as to be present in both reference genomes and soil bases due to differences in functional constraints experi- metagenomics data sets. enced by different parts of the gene [6]. Proteins from families of broadly-conserved genes and those parts of Results enzymes near an active site have significantly higher se- We begin by justifying our choice of k = 10 as the match quence identity than average [7]. The nature of current length long enough to be specific, yet short enough to shotgun metagenomics data, with short reads from be prevalent in environmental samples. Building on this randomly-selected regions of genes, tends to accentuate observation, we identify all 20 million strings of amino the problem of transforming similarity scores to some- acids of length 10 which are shared by at least two refer- thing resembling phylogenetic distances through ence genomes from distinct genera of bacteria. We de- injecting a noise term that can be difficult to remove by note these as orthogenomic signature peptides, and they post-processing (e.g., [8-10]). This problem exists even serve as the foundation for the rest of our analysis. We for close matches, but is exacerbated as similarity then develop one algorithm to establish the correct declines since the underlying sequence alignment may phylogenetic placement of these signature peptides, an- also be called into question. other to classify metagenomic reads matched by signa- A variety of methods to classify shotgun metagenomic ture peptides, and a final algorithm to functionally reads have been proposed, primarily based on protein profile reads through use of an externally defined data- families or gene clusters. These include partial assembly base. The use of fixed-length strings allows us to exploit and hidden-Markov-model searches [11] of protein fam- standard index-based information retrieval techniques ilies [12,13]; finding the closest neighbors in either nu- developed for web search engines. cleotide or protein space using a variety of similarity scores [8,14]; and finding shared sub-strings of variable Choice of k =10 length via suffix trees [15]. Other alternatives to similar- Figure 1 shows the run-length distribution of shared ity scores include short-seed [16] and sub-HMM [17] amino acid k-mers between Escherichia coli and various methods. Phylogenetic analysis is typically the next step sets of bacterial genomes, for k in the range of 3 to 500. after classification, using Least Common Ancestor [8], It is dominated by random matches for k < 8, and domi- nearest-neighbor [14,15], or hierarchical scoring [9] to nated by gene matches for k > 8. The solid red line is an assign phylogeny to the sequences identified in the clas- exponential fit to the run-length distribution of amino sification step. Because of the numerous pitfalls in acid matches between E. coli and Bacillus subtilis, designing a computer algorithm to define functionally reflecting a 16-fold reduction in the number of matches meaningful protein families [18], many classification for each increase by one in the run-length, k. The num- pipelines require continual curation of protein families, ber of random matches quickly drops with increasing k, which involves multiple-sequence alignment and the reaching 1.8 per pair of bacterial genomes for k=10, computation of a phylogenetic tree for each family, in with a ratio of observed matches of length 10 to the the hope of identifying orthologous genes [12,19]. Such expected number of random matches of 250. Since the efforts are labor-intensive and limited by the paucity of rest of our analysis will treat matches longer than 10 as biochemical validation of gene function. Another solu- multiple overlapping 10-mers, and the histogram in tion is to restrict analysis to a small number of well- Figure 1 counts matches only once, at the full extent of behaved ‘housekeeping’ genes [20-22]. However, using their match length, the appropriate ratio of non-random this approach results in discarding the vast majority of to random matches is not 250, but 1000. sequence reads. In addition to the small-k behavior of random Exact amino acid k-mer matches with k in the range matches, Figure 1 shows the large-k behavior for E. coli 3–6 have been employed to speed identification of scanned against three sets of bacteria. As the Berendzen et al. BMC Research Notes 2012, 5:460 Page 3 of 21 http://www.biomedcentral.com/1756-0500/5/460 A vertical slice through Figure 2, then, will approxi- mate the presence / absence phylogenetic profile of each gene across enteric bacteria (bottom portion) and repre- sentatives of the bacterial kingdom (top portion).