Sequence Classification Using Reference Taxonomies
Total Page:16
File Type:pdf, Size:1020Kb
Sequence Classification using Reference Taxonomies Gabriel Valiente Algorithms, Bioinformatics, Complexity and Formal Methods Research Group Technical University of Catalonia Computational Biology and Bioinformatics Research Group Research Institute of Health Science, University of the Balearic Islands Centre for Genomic Regulation Barcelona Biomedical Research Park Phylogenetics: New Data, New Phylogenetic Challenges Isaac Newton Institute for Mathematical Sciences Cambridge, UK, 20–24 June 2011 Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 1 / 69 Abstract Next generation sequencing technologies have opened up an unprecedented opportunity for microbiology by enabling the culture-independent genetic study of complex microbial communities, which were so far largely unknown. The analysis of metagenomic data is challenging, since a sample may contain a mixture of many different microbial species, whose genome has not necessarily been sequenced beforehand. In this talk, we address the problem of analyzing metagenomic data for which databases of reference sequences are already known. We discuss both composition and alignment-based methods for the classification of sequence reads, and present recent results on the assignment of ambiguous sequence reads to microbial species at the best possible taxonomic rank. Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 2 / 69 Where we left back in 2007. Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. 1 G. Cardona, F. Rossello,´ G. Valiente. Tripartitions do not always discriminate Phylogenetic Networks. Mathematical Biosciences (2008) 2 G. Cardona, F. Rossello,´ G. Valiente. A Perl Package and an Alignment Tool for Phylogenetic Networks. BMC Bioinformatics (2008) 3 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. A Distance Metric for a Class of Tree-Sibling Phylogenetic Networks. Bioinformatics (2008) 4 M. Arenas, G. Valiente, D. Posada. Characterization of Reticulate Networks based on the Coalescent with Recombination. Molecular Biology and Evolution (2008) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. 5 G. Cardona, F. Rossello,´ G. Valiente. Comparison of Tree-Child Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009) 6 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Metrics for Phylogenetic Networks I: Generalizations of the Robinson-Foulds Metric. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009) 7 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Metrics for Phylogenetic Networks II: Nodal and Triplets Metrics. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. 8 G. Cardona, F. Rossello,´ G. Valiente. Extended Newick: It is Time for a Standard Representation of Phylogenetic Networks. BMC Bioinformatics (2008) 9 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. On Nakhleh’s Metric for Reduced Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009) 10 F. Rossello,´ G. Valiente. All that Glisters is not Galled. Mathematical Biosciences (2009) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. 11 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Path Lengths in Tree-Child Time Consistent Hybridization Networks. Information Sciences (2010) 12 M. Arenas, M. Patricio, D. Posada, G. Valiente. Characterization of Phylogenetic Networks with NetTest. BMC Bioinformatics (2010) 13 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Comparison of Galled Trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2011) 14 T. Asano, J. Jansson, K. Sadakane, R. Uehara, G. Valiente. Faster Computation of the Robinson-Foulds Distance between Phylogenetic Networks. (2011) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Plan of the talk 1 Taxonomic Classification 2 Classification in Genomics 3 Classification in Metagenomics 4 Composition-Based Classification Methods 5 Alignment-Based Classification Methods 6 Classification of Ambiguous Sequences 7 Taxonomic Diversity 8 Taxonomic Classification in Practice Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 4 / 69 Taxonomic Classification C. Linnæus (1735) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 5 / 69 Kingdom Plantae Phylum Streptophyta Class Streptophytina Order Solanales Family Solanaceae Genus Solanum Species Solanum lycopersicum Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis simplicibus Taxonomic Classification K P C O F G S Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 6 / 69 Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis simplicibus Taxonomic Classification King Kingdom Plantae Phillip Phylum Streptophyta Came Class Streptophytina Over Order Solanales For Family Solanaceae Green Genus Solanum Soup Species Solanum lycopersicum Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 6 / 69 Taxonomic Classification King Kingdom Plantae Phillip Phylum Streptophyta Came Class Streptophytina Over Order Solanales For Family Solanaceae Green Genus Solanum Soup Species Solanum lycopersicum Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis simplicibus Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 6 / 69 Taxonomic Classification C. Darwin (1837–1843) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 7 / 69 Taxonomic Classification E. Haeckel (1866) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 8 / 69 Taxonomic Classification H. F. Copeland (1938) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 9 / 69 Taxonomic Classification R. H. Whittaker (1969) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 10 / 69 Taxonomic Classification Linnæus Haeckel Copeland Whittaker Woese 1735 1866 1938 1969 1987 Bacteria Monera Monera Protista Archaea Protoctista Protista Plantae Plantae Plantae Plantae Eukarya Protoctista Fungi Animalia Animalia Animalia Animalia Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 11 / 69 Taxonomic Classification Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 12 / 69 Taxonomic Classification Linnæan taxonomy methods Classifying species of organisms into ranks (kingdom, class, order, family, genus, species) Phenetic (numerical taxonomy) methods Classifying species of organisms on the basis of overall morphological similarity Cladistic methods Classifying species of organisms into monophyletic groups called clades on the basis of shared derived characters Evolutionary taxonomy methods Classifying species of organisms on the basis of phylogenetic relationship and overall morphological similarity D. Gusfield (1997) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 13 / 69 Taxonomic Classification High-Throughput Sequencing Technologies J. E. Cohen (2004) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 14 / 69 Taxonomic Classification High-Throughput Sequencing Technologies Instrument Throughput Read Turnaround Raw read length time accuracy ABI 3730 DNA Analyzer 32M 700 1h 98.5% Q20 Roche/454 GS FLX 400M 400 10h 99% Q20 Illumina/Solexa GA IIe 4G 35 2d 90% Q30 ABI/SOLiD 3 Plus 25G 50 7d 80% Q30 Helicos HeliScope 35G 35 8d ··· PB PacBio RS ···T 964 ··· ··· There is a clear tendency to higher throughput, at the expense of slower turnaround and lower raw read accuracy Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 15 / 69 Classification in Genomics C. R. Woese (1987) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 16 / 69 Classification in Genomics N. Pace (1997) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 17 / 69 Classification in Genomics D. H. Bergey (1923–1994; 1984–2013) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 18 / 69 Classification in Genomics Resources • NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/ • ARB-SILVA http://www.arb-silva.de/ • Greengenes http://greengenes.lbl.gov/ • Ribosomal Database Project http://rdp.cme.msu.edu/ • Taxonomic Oultine of Bacteria and Archaea http://www.taxonomicoutline.org/ • Integrated Taxonomic Information System http://www.itis.gov/ • Encyclopedia of Life http://www.eol.org/ • Tree of Life http://www.tolweb.org/tree/ Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 19 / 69 Classification in Metagenomics Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 20 / 69 Classification in Metagenomics D. H. Huson, A. F. Auch, J. Qi, S. C. Schuster (2007) Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 21 / 69 Classification of 16S ribosomal RNA sequences • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit Composition-Based Classification Methods Classification of whole genomes • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 22 / 69 Composition-Based Classification Methods Classification of whole genomes • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit Classification of 16S ribosomal RNA sequences • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 22 / 69 Composition-Based Classification Methods Classification of whole genomes • k-mer