Accurate Phylogenetic Classification of DNA Fragments Based on Sequence Composition
Total Page:16
File Type:pdf, Size:1020Kb
Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Accurate phylogenetic classification of DNA fragments based on sequence composition Permalink https://escholarship.org/uc/item/5kc0f71j Authors McHardy, Alice C. Garcia Martin, Hector Tsirigos, Aristotelis et al. Publication Date 2006-05-01 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Accurate phylogenetic classification of DNA fragments based on sequence composition Alice C. McHardy 1, Héctor García Martín 2, Aristotelis Tsirigos 1, Philip Hugenholtz 2, † and Isidore Rigoutsos 1 1 Bioinformatics and Pattern Discovery Group, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA 2 US Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA, USA. † To whom correspondence should be addressed. 1 Abstract Metagenome studies have retrieved vast amounts of sequence out of a variety of environments, leading to novel discoveries and great insights into the uncultured microbial world. Except for very simple communities, diversity makes sequence 5 assembly and analysis a very challenging problem. To understand the structure and function of microbial communities, a taxonomic characterization of the obtained sequence fragments is highly desirable, yet currently limited mostly to those sequences that contain phylogenetic marker genes. We show that for clades at the rank of domain down to genus, sequence composition allows the very accurate phylogenetic 10 characterization of genomic sequence. We developed a composition-based classifier, PhyloPythia, for de novo phylogenetic sequence characterization and have trained it on a data set of 340 genomes. By extensive evaluation experiments we show that the method is accurate across all taxonomic ranks considered, even for sequences that originate from novel organisms and are as short as 1kb. Application to two metagenome datasets 15 obtained from samples of phosphorus-removing sludge showed that the method allows the accurate classification at genus level of most sequence fragments from the dominant populations, while at the same time correctly characterizing even larger parts of the samples at higher taxonomic levels. 20 25 30 2 Introduction The emerging field of metagenomics is dedicated to the study of sequences obtained directly by high-throughput sequencing of DNA samples from microbial communities. 5 The approach has already delivered exciting insights into the lifestyle, evolution and characteristics of microbial organisms 3-5 that could not have been obtained otherwise, as the vast majority of microbes resist cultivation 6. From a technical standpoint, the field has created novel computational challenges, such as a need for assembly and gene finding programs tailored for highly diverse sequence collections of organisms sampled with 10 different abundances, and tools for the accurate phylogenetic characterization of the short sequences that are created in vast amounts. One approach to classifying metagenomic sequence fragments is the use of ‘marker genes’, such as ribosomal RNAs, as phylogenetic anchors for identification of the source organism of a fragment. Ribosomal RNAs are highly conserved and allow the 15 most accurate placement of an organism or sequence harboring the respective gene within the tree of life. The original observation and the framework for the quantification of evolutionary relationships resulted from the pioneering work of Carl Woese and colleagues 7, 8 . Even though the original marker gene collection has been expanding through inclusion of ubiquitous and slowly evolving or clade-specific proteins 9-11 , the 20 approach permits the characterization of only a limited number of fragments. For a very low complexity community found in acid mine drainage 5, an organism-specific ‘binning’ of fragments based on GC-content and read coverage retrieved near-complete genomes for the dominant species, which displayed significant differences in genomic GC-content. The use of tetranucleotide signatures has also shown promise in the characterization of 25 low-complexity communities or the dominant organisms of more complex populations 12 . However, these schemes are not able to characterize the more diverse (and thus more challenging) metagenomes such as those from the very complex communities found in soil, which are estimated to contain millions of distinct taxa 13 . In the case of an extremely complex Minnesota soil community, a gene-centric characterization of the sample was 30 undertaken, because less than 1% of reads could be assembled 4. Gene-centric analyses, while useful for determining genes important for overall community function, in most 3 cases do not allow identification of the species harboring these genes. It is thus imperative that fast and accurate tools be developed that will allow the taxonomic characterization of short genomic sequence fragments and enable more comprehensive metagenome analyses. 5 In what follows, we present a method which uses clade-specific characteristics in sequence composition to phylogenetically characterize sequence fragments. Genomic sequence composition is well known to reflect organism-specific characteristics, which has been dubbed the ‘genome signature’ 14-19 . The phenomenon is sufficiently pronounced to allow the simultaneous supervised or unsupervised discrimination between several 10 different species for even relatively short sequence fragments12, 20-22 . Furthermore, these signatures also carry phylogenetic information, as recent studies on smaller data sets have shown 23, 24 . Based on draft or high-quality genomic sequences of 340 organisms from all domains of life, we constructed, optimized and extensively evaluated a composition- based phylogenetic classifier. For classification we use a multi-class Support Vector 15 Machine classifier with the oligonucleotide composition of genome fragments as the input space. Due to the multitude of influences on sequence composition, the SVM is a good technique for this problem, as it is able to learn the relevant clade-specific characteristics of sequence composition also in spaces that are dominated by other influences. Across the complete sequence space considered, our method allows the 20 accurate phylogenetic classification of genomic fragments. This is true for all taxonomic ranks considered (domain, phylum, class, order and genus), and, more importantly, for previously unseen fragments which originate from novel organisms. We named our new method for phylogenetic sequence characterization PhyloPythia 1. We applied the method for the phylogenetic characterization of two metagenome 25 samples of biological phosphorus removing sludge as used in the industrial processing of wastewater 25 . Our technique was able to automatically assign the phylogenetically characterized fragments of these samples (based on marker genes, population overlap between communities and scaffolding of contigs by read pair information) to the correct clades with high accuracy. Furthermore, additional genomic fragments could be assigned 30 that could not be characterized by any other method, and larger parts of the sample could 1 After the prophetess of the oracle of Delphi, who was commonly known as the Pythia. 4 be characterized at higher taxonomic levels in agreement with marker-gene based studies of sample composition. Materials and Methods 5 Compositional sequence patterns For compositional feature analysis, a given piece of DNA sequence s is mapped to a higher-dimensional space of nucleotide patterns = { 1, 2,..., q}, where is defined by the pattern length w and the number of literals l 19, 26 (Supplementary Figure S10). In this space, s is represented by the compositional input vector #(s) = ( 1, 2,..., q); where i is 10 the frequency of pattern i in s. The method we propose is a generalization of conventional compositional approaches and exhibits several desirable properties. First, nucleotide patterns of arbitrary lengths and densities can be computed, which in turn allows to select the parameters with the most discriminatory power. Second, the method extends straightforward composition-based schemes in that is able to ‘ignore’ certain 15 nucleotide positions: this is achieved through the use of generating templates that include ‘gaps’ and thus do not comprise continuous nucleotides. One such example template is A.G, which will match any of AAG , ACG , AGG or ATG , while ignoring the identity of the nucleotide that occupies the middle position. Third, optionally the periodicity of the genetic code is taken into account: in particular, when collecting the instances of a 20 pattern, the constraint can be imposed that a pattern be position-specific. The input vectors are subsequently normalized by the total number of patterns for each sequence. Multi-class classification Phylogenetic classification is a multi-class problem, where at any given rank, such as e.g. 25 the domain, an organism belongs to exactly one out of all existing clades. In the multi- class model, each adequately sampled clade for a particular rank is represented by an individual class. Adequately sampled here means a representation by at least 3 or more different species in the genomic data set. This allows us to estimate the classification accuracy for sequences from novel, unknown organisms of this clade by setting aside the 30 sequence from at least one organism for testing, and using those of the other two for learning the clade-specific properties in