Sequence Classification using Reference Taxonomies

Gabriel Valiente

Algorithms, Bioinformatics, Complexity and Formal Methods Research Group Technical University of Catalonia

Computational Biology and Bioinformatics Research Group Research Institute of Health Science, University of the Balearic Islands

Centre for Genomic Regulation Barcelona Biomedical Research Park

Phylogenetics: New Data, New Phylogenetic Challenges Isaac Newton Institute for Mathematical Sciences Cambridge, UK, 20–24 June 2011

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 1 / 69 Abstract

Next generation sequencing technologies have opened up an unprecedented opportunity for microbiology by enabling the culture-independent genetic study of complex microbial communities, which were so far largely unknown. The analysis of metagenomic data is challenging, since a sample may contain a mixture of many different microbial species, whose genome has not necessarily been sequenced beforehand. In this talk, we address the problem of analyzing metagenomic data for which databases of reference sequences are already known. We discuss both composition and alignment-based methods for the classification of sequence reads, and present recent results on the assignment of ambiguous sequence reads to microbial species at the best possible taxonomic rank.

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 2 / 69 Where we left back in 2007. . .

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. . .

1 G. Cardona, F. Rossello,´ G. Valiente. Tripartitions do not always discriminate Phylogenetic Networks. Mathematical Biosciences (2008)

2 G. Cardona, F. Rossello,´ G. Valiente. A Perl Package and an Alignment Tool for Phylogenetic Networks. BMC Bioinformatics (2008) 3 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. A Distance Metric for a Class of Tree-Sibling Phylogenetic Networks. Bioinformatics (2008)

4 M. Arenas, G. Valiente, D. Posada. Characterization of Reticulate Networks based on the Coalescent with Recombination. Molecular Biology and Evolution (2008)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. . .

5 G. Cardona, F. Rossello,´ G. Valiente. Comparison of Tree-Child Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009) 6 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Metrics for Phylogenetic Networks I: Generalizations of the Robinson-Foulds Metric. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009) 7 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Metrics for Phylogenetic Networks II: Nodal and Triplets Metrics. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. . .

8 G. Cardona, F. Rossello,´ G. Valiente. Extended Newick: It is Time for a Standard Representation of Phylogenetic Networks. BMC Bioinformatics (2008) 9 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. On Nakhleh’s Metric for Reduced Phylogenetic Networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2009) 10 F. Rossello,´ G. Valiente. All that Glisters is not Galled. Mathematical Biosciences (2009)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Where we left back in 2007. . .

11 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Path Lengths in Tree-Child Time Consistent Hybridization Networks. Information Sciences (2010)

12 M. Arenas, M. Patricio, D. Posada, G. Valiente. Characterization of Phylogenetic Networks with NetTest. BMC Bioinformatics (2010) 13 G. Cardona, M. Llabres,´ F. Rossello,´ G. Valiente. Comparison of Galled Trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2011)

14 T. Asano, J. Jansson, K. Sadakane, R. Uehara, G. Valiente. Faster Computation of the Robinson-Foulds Distance between Phylogenetic Networks. (2011)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 3 / 69 Plan of the talk

1 Taxonomic Classification

2 Classification in Genomics

3 Classification in Metagenomics

4 Composition-Based Classification Methods

5 Alignment-Based Classification Methods

6 Classification of Ambiguous Sequences

7 Taxonomic Diversity

8 Taxonomic Classification in Practice

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 4 / 69 Taxonomic Classification

C. Linnæus (1735)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 5 / 69 Kingdom Plantae Phylum Streptophyta Class Streptophytina Order Solanales Family Solanaceae Genus Solanum Species Solanum lycopersicum

Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis simplicibus

Taxonomic Classification

K P C O F G S

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 6 / 69 Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis simplicibus

Taxonomic Classification

King Kingdom Plantae Phillip Phylum Streptophyta Came Class Streptophytina Over Order Solanales For Family Solanaceae Green Genus Solanum Soup Species Solanum lycopersicum

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 6 / 69 Taxonomic Classification

King Kingdom Plantae Phillip Phylum Streptophyta Came Class Streptophytina Over Order Solanales For Family Solanaceae Green Genus Solanum Soup Species Solanum lycopersicum

Solanum caule inermi herbaceo, foliis pinnatis incisis, racemis simplicibus

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 6 / 69 Taxonomic Classification

C. Darwin (1837–1843)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 7 / 69 Taxonomic Classification

E. Haeckel (1866)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 8 / 69 Taxonomic Classification

H. F. Copeland (1938)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 9 / 69 Taxonomic Classification

R. H. Whittaker (1969)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 10 / 69 Taxonomic Classification

Linnæus Haeckel Copeland Whittaker Woese 1735 1866 1938 1969 1987 Monera Monera Protista Protoctista Protista Plantae Plantae Plantae Plantae Eukarya Protoctista Fungi Animalia Animalia Animalia Animalia

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 11 / 69 Taxonomic Classification

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 12 / 69 Taxonomic Classification

Linnæan methods Classifying species of organisms into ranks (kingdom, class, order, family, genus, species) Phenetic (numerical taxonomy) methods Classifying species of organisms on the basis of overall morphological similarity Cladistic methods Classifying species of organisms into monophyletic groups called clades on the basis of shared derived characters Evolutionary taxonomy methods Classifying species of organisms on the basis of phylogenetic relationship and overall morphological similarity

D. Gusfield (1997)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 13 / 69 Taxonomic Classification High-Throughput Sequencing Technologies

J. E. Cohen (2004)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 14 / 69 Taxonomic Classification High-Throughput Sequencing Technologies

Instrument Throughput Read Turnaround Raw read length time accuracy ABI 3730 DNA Analyzer 32M 700 1h 98.5% Q20 Roche/454 GS FLX 400M 400 10h 99% Q20 Illumina/Solexa GA IIe 4G 35 2d 90% Q30 ABI/SOLiD 3 Plus 25G 50 7d 80% Q30 Helicos HeliScope 35G 35 8d ··· PB PacBio RS ···T 964 ··· ···

There is a clear tendency to higher throughput, at the expense of slower turnaround and lower raw read accuracy

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 15 / 69 Classification in Genomics

C. R. Woese (1987)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 16 / 69 Classification in Genomics

N. Pace (1997)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 17 / 69 Classification in Genomics

D. H. Bergey (1923–1994; 1984–2013)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 18 / 69 Classification in Genomics Resources

• NCBI Taxonomy http://www.ncbi.nlm.nih.gov/Taxonomy/ • ARB-SILVA http://www.arb-silva.de/ • Greengenes http://greengenes.lbl.gov/ • Ribosomal Database Project http://rdp.cme.msu.edu/ • Taxonomic Oultine of Bacteria and Archaea http://www.taxonomicoutline.org/

• Integrated Taxonomic Information System http://www.itis.gov/ • Encyclopedia of Life http://www.eol.org/ • Tree of Life http://www.tolweb.org/tree/

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 19 / 69 Classification in Metagenomics

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 20 / 69 Classification in Metagenomics

D. H. Huson, A. F. Auch, J. Qi, S. C. Schuster (2007)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 21 / 69 Classification of 16S ribosomal RNA sequences • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit

Composition-Based Classification Methods

Classification of whole genomes • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 22 / 69 Composition-Based Classification Methods

Classification of whole genomes • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit

Classification of 16S ribosomal RNA sequences • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 22 / 69 Composition-Based Classification Methods

Classification of whole genomes • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit

Classification of 16S ribosomal RNA sequences • k-mer searching • Top hit Closest sequence to the sequence read • Best stratum Sequences at the same distance as the top hit

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 22 / 69 Composition-Based Classification Methods Resources

• DISCRIBINATE • MOTHUR http://www.mothur.org/ • Phymm http://www.cbcb.umd.edu/software/phymm/ • TACOA http://www.cebitec.uni-bielefeld.de/brf/tacoa/tacoa.html • TETRA http://www.megx.net/tetra/

. .

• NBC http://nbc.ece.drexel.edu/ • PhyloPythia http://cbcsrv.watson.ibm.com/phylopythia.html • RDP Classifier http://rdp.cme.msu.edu/classifier/ • SPHINX http://metagenomics.atc.tcs.com/SPHINX/

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 23 / 69 Classification of 16S ribosomal RNA sequences • BWT alignment of sequence reads • Top hit Sequences with at most k mismatches • Best stratum Sequences with same number of mismatches as the top hit

Alignment-Based Classification Methods

Classification of whole genomes • BLAST alignment of sequence reads • Top hit E-value up to 0.001 • Best stratum Sequences with the same E-value as the top hit

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 24 / 69 Alignment-Based Classification Methods

Classification of whole genomes • BLAST alignment of sequence reads • Top hit E-value up to 0.001 • Best stratum Sequences with the same E-value as the top hit

Classification of 16S ribosomal RNA sequences • BWT alignment of sequence reads • Top hit Sequences with at most k mismatches • Best stratum Sequences with same number of mismatches as the top hit

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 24 / 69 Alignment-Based Classification Methods

Classification of whole genomes • BLAST alignment of sequence reads • Top hit E-value up to 0.001 • Best stratum Sequences with the same E-value as the top hit

Classification of 16S ribosomal RNA sequences • BWT alignment of sequence reads • Top hit Sequences with at most k mismatches • Best stratum Sequences with same number of mismatches as the top hit

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 24 / 69 Alignment-Based Classification Methods Resources

• CARMA http://webcarma.cebitec.uni-bielefeld.de/ • MEGAN http://www-ab.informatik.uni-tuebingen.de/software/megan/

. .

• CAMERA http://camera.calit2.net/ • Galaxy http://galaxy.psu.edu/ • MG-RAST http://metagenomics.anl.gov/

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 25 / 69 Classification of Ambiguous Sequences

Taxonomic assignment problem Input A genomic reference S (set of sequences) A taxonomic reference T (tree) with a leaf set L, where each leaf in L has an associated known sequence of S A set R of sequence (short or long) reads A positive integer k

Output For each read Ri ∈ R, a single node in T that represents in a “good” way the subset Mi ⊆ L of hits or matches whose sequences contain a substring with at most k mismatches to Ri

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 26 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Accuracy and coverage of taxonomic assignment Precision Proportion of correctly labeled positive elements with respect to the total number of elements labeled positive (correctly or incorrectly labeled positive) TP P = TP + FP Recall Proportion of correctly labeled positive elements with respect to the total number of positive elements (correctly labeled positive or incorrectly labeled negative) TP R = TP + FN F-measure Harmonic mean of precision and recall

2 2PR F = = 1 1 P + R P + R

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 27 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Accuracy and coverage of taxonomic assignment Given a reference taxonomy T , a set R of sequence reads, and a threshold value k of sequence similarity,

• Let Ri be the ith read

• Let Mi be the leaves of T matching Ri with up to k mismatches

• Let Ti be the subtree of T rooted at the lowest common ancestor of Mi

• Let Ni be the leaves of Ti not matching Ri with up to k mismatches

For the ith read, the leaves of Ti can be partitioned in the following four subsets:

• TPi = Mi (true positives)

• FPi = Ni (false positives)

• TNi = 0/ (true negatives)

• FNi = 0/ (false negatives)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 28 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Accuracy and coverage of taxonomic assignment

Ti

Ni Mi

FPi TPi

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 29 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Accuracy and coverage of taxonomic assignment Given a reference taxonomy T , a set R of sequence reads, and a threshold value k of sequence similarity,

• Let Tij be the subtree of T rooted at the jth node of Ti

• Let Mij be the leaves of Tij matching Ri with up to k mismatches

• Let Nij be the leaves of Tij not matching Ri with up to k mismatches

For the ith read and the jth node of Ti , the leaves of Ti can be partitioned in the following four subsets:

• TPij = Mij (true positives)

• FPij = Nij (false positives)

• TNij = Ni \ Nij (true negatives)

• FNij = Mi \ Mij (false negatives)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 30 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Accuracy and coverage of taxonomic assignment

Ti

Tij

Ni Nij Mij Mi

TNij FPij TPij FNij

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 31 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Accuracy and coverage of taxonomic assignment

Given hits Mi = {s1,s5,s6,s7}, assigning read Ri to the jth (blue circled) node of Ti partitions the leaves of Ti into true positives TPi,j = {s5,s6,s7}, false positives FPi,j = {s8}, true negatives TNi,j = {s2,s3,s4}, and false negatives FNi,j = {s1}

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 32 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Accuracy and coverage of taxonomic assignment

Precision The precision of classifying Ri as Tj is

|TPij | Pij = |TPij | + |FPij |

Recall The recall of classifying Ri as Tj is

|TPij | Rij = |TPij | + |FNij |

F-measure The combined F-measure of precision and recall is

2 2Pij Rij Fij = 1 1 = + Pij + Rij Pij Rij

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 33 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Example (Accuracy and coverage of taxonomic assignment)

Bacteria Aquificae Aquificae Aquificales Aquificaceae Aquifex Hydrogenobaculum Hydrogenobaculum acidophilum P = 6/(6 + 8) = 43% Hydrogenobacter R = 6/(6 + 0) = 100% Hydrogenobacter subterraneus Hydrogenobacter thermophilus F = 60% Hydrogenobacter hydrogenophilus Persephonella Persephonella hydrogeniphila Persephonella marina Persephonella guaymasensis Sulfurihydrogenibium Sulfurihydrogenibium subterraneum P = 3/(3 + 0) = 100% Sulfurihydrogenibium azorense R = 3/(3 + 3) = 50% Sulfurihydrogenibium yellowstonense Thermocrinis F = 67% Thermocrinis albus Thermocrinis ruber Hydrogenivirga Hydrogenivirga caldilitoris

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 34 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Beyond the F-measure F-measure The combined F-measure of precision and recall is

2Pij Rij 2|TPij | Fij = = Pij + Rij |FNij | + |FPij | + 2|TPij |

Penalty score The penalty score of assigning Ri to Tj is

|FNij | |FPij | PSij = q + (1 − q) |TPij | |TPij |

The parameter q takes values in the range [0,1] and determines how close to the LCA or to the leaves the assignment shall be q = 0 Each read Ri is assigned to a matching leaf 1 q = 2 Each read Ri is assigned to the node that maximizes the F-measure q = 1 Each read Ri is assigned to the LCA of the matching leaves Mi

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 35 / 69 Classification of Ambiguous Sequences Sensitivity and Specificity

Beyond the F-measure Generalization of the F-measure For q = 0.5, the node m that minimizes the penalty score is

argmin(|FNim|/|TPim| + |FPim|/|TPim|) m

= argmin((|FNim| + |FPim|)/|TPim|) m

= argmin((|FNim| + |FPim|)/2|TPim|) m

= argmin(((|FNim| + |FPim|)/2|TPim|) + 1) m

= argmin((|FNim| + |FPim| + 2|TPim|)/2|TPim|) m

= argmax(2|TPim|/(|FNim| + |FPim| + 2|TPim|)) m

The node that minimizes the penalty score is the same node that would maximize the F-measure

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 36 / 69 Classification of Ambiguous Sequences Efficient Algorithms

Theorem

Given a set Mi ⊆ L of hits and the subtree Ti of T rooted at the LCA of Mi , the penalty scores PSi,j for every node j in Ti can be obtained in O(|Ti |) total time

Theorem

Given a set Mi ⊆ L of hits and the subtree Ti of T rooted at the LCA of Mi , the penalty scores PSi,j for every node j in Ti can be obtained in O(|Mi |) total time after O(|T |) time preprocessing

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 37 / 69 Classification of Ambiguous Sequences Efficient Algorithms

Theorem

Given a set Mi ⊆ L of hits and the subtree Ti of T rooted at the LCA of Mi , the penalty scores PSi,j for every node j in Ti can be obtained in O(|Ti |) total time

Proof sketch.

The penalty score formula can be rewritten in terms of |Mi |, |Mi,j | and |Ni,j | as

q · |FNi,j | + (1 − q) · |FPi,j | q · (|Mi | − |Mi,j |) + (1 − q) · |Ni,j | PSi,j = = |TPi,j | |Mi,j |

For any node j in Ti , the values of |Mi,j | and |Ni,j | may be expressed as

• If j is a leaf in Ti and j ∈ Mi , then |Mi,j | = 1, |Ni,j | = 0, and |Li,j | = 1

• If j is a leaf in Ti and j 6∈ Mi , then |Mi,j | = 0, |Ni,j | = 1, and |Li,j | = 1

• If j is an internal node in Ti , then |Mi,j | = ∑j0 |Mi,j0 | and |Li,j | = ∑j0 |Li,j0 |, 0 where j ranges over the children of j in Ti , and |Ni,j | = |Li,j | − |Mi,j |

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 38 / 69 Classification of Ambiguous Sequences Efficient Algorithms

Theorem

Given a set Mi ⊆ L of hits and the subtree Ti of T rooted at the LCA of Mi , the penalty scores PSi,j for every node j in Ti can be obtained in O(|Mi |) total time after O(|T |) time preprocessing

Proof sketch.

Penalty scores PSi,j only need to be computed for relevant nodes in Ti (those nodes of the topological restriction Ti ||Mi of Ti to Mi ) and there are O(|Mi |) relevant nodes, whose penalty scores can be obtained by a bottom-up traversal of Ti ||Mi

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 39 / 69 Classification of Ambiguous Sequences Efficient Algorithms

Definition

Any node j in Ti is called relevant if it is a leaf in Mi or the LCA of two or more leaves in Mi

Lemma 0 For each node j in Ti there exists a relevant node j such that PSi,j0 6 PSi,j

Proof. 0 Suppose that j is a node in Ti that is not relevant. Let j be the LCA of the 0 leaves in Mi,j . Clearly, j is relevant and, furthermore, |Mi,j | = |Mi,j0 | while |Ni,j | > |Ni,j0 | since Ti,j0 is a subtree of Ti,j . It follows that

|Ni,j0 | |Ni,j | |Ni,j0 | − |Ni,j | PSi,j0 − PSi,j = (1 − q) · − (1 − q) · = (1 − q) · 6 0 |Mi,j0 | |Mi,j | |Mi,j |

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 40 / 69 Classification of Ambiguous Sequences Resources

• TANGO http://www.lsi.upc.edu/˜valiente/tango/

. .

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 41 / 69 Taxonomic Diversity Diversity and Richness

Diversity Relation between number of species and number of individuals in a sample α-diversity Species diversity within an ecosystem β-diversity Change in species diversity within an ecosystem . . ω-diversity Phylogenetic difference between species in an ecosystem Richness Number of species in a sample

R. H. Whittaker (1972)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 42 / 69 Taxonomic Diversity α Diversity

Shannon-Wiener index Information entropy of the distribution of a sample taken from a population of species

n H = − ∑ pi lnpi i=1

• ni is the abundance of species i • n is the total number of individuals

• pi = ni /n is the relative abundance of species i

C. E. Shannon (1948)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 43 / 69 Taxonomic Diversity ω Diversity

Clarke-Warwick index Quadratic information entropy of the distribution of a sample taken from a population of species

n n Q = ∑ ∑ wij pi pj i=1 j=1

• ni is the abundance of species i • n is the total number of individuals

• pi = ni /n is the relative abundance of species i • wij is the distance in the taxonomic reference between species i and j

K. R. Clarke, R. M. Warwick (1998)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 44 / 69 Taxonomic Diversity

Families

Genera

Species

Individuals

• p1 = 5/14 p2 = 2/14 p3 = 3/14 p4 = 1/14 p5 = 3/14 • H = 1.4944 • Q = 3.6939

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 45 / 69 Taxonomic Classification in Practice Case Study I

Metagenomic data set Roche/454 sequence reads of 16S ribosomal RNA • M. L. Sogin, H. G. Morrison, J. A. Huber et al (2006) • L. Dethlefsen, S. M. Huse, M. L. Sogin, D. A. Relman (2008) • P. J. Turnbaugh, M. Hamady, T. Yatsunenko et al (2009) • VAMPS • C. Manichanh, J. Reeder, P. Gibert et al (2010) • C. Quince, A. Lanzen,´ T. P. Curtis et al (2009) • D. C. Richter, F. Ott, A. F. Auch, R. Schmid, D. H. Huson (2008) Reference taxonomy 5,165 near-full-length type cultures of high quality • J. R. Cole, Q. Wang, E. Cardenas et al (2009) Alignment of sequence reads GEM, BLAST Ambiguities Sequences with at most 2 mismatches (99% identity for 200 bp) Assignment of ambiguous sequence reads LCA, TANGO Diversity measures Shannon-Wiener

J. C. Clemente, J. Jansson, G. Valiente (2011)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 46 / 69 Taxonomic Classification in Practice Case Study I

Ambiguous reads Data Set GEM BLAST No hit One hit Ambiguous No hit One hit Ambiguous Marine 194,015 4,655 23,621 66,493 8,741 147,057 Human 527,727 334,521 91,335 31,881 812,392 109,310 Twins V2 990,094 128,649 776 89 1,068,762 50,668 Twins V6 523,161 199,782 94,999 36,603 648,975 132,364 Chicken 10,442 7,140 4,395 1,548 13,084 7,345 Rat 273,114 27,226 31,509 2 287,971 43,876

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 47 / 69 Taxonomic Classification in Practice Case Study I

Ambiguous reads Data 5,165 type cultures 322,864 sequences Set No hit One hit Ambiguous No hit One hit Ambiguous FASTA 26,458 642 1,261 19,715 991 7,655 FASTQ 26,045 769 1,547 19,182 1,213 7,966

No hit One hit Ambiguous

FASTA 26,458 642 1,261

26045 158 255 1261 611 31

FASTQ 26,045 769 1,547

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 48 / 69 Taxonomic Classification in Practice Case Study I

Ambiguous reads (FASTA versus FASTQ) Leaf F-measure LCA Rank 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Kingdom 0 0 0 0 0 0 0 0 0 0 3 Phylum 0 0 0 0 0 0 0 0 0 0 0 Class 0 0 0 0 0 0 0 0 0 0 4 Order 0 0 0 0 0 1 1 1 1 2 366 Family 0 0 1 5 20 47 105 204 232 242 345 Genus 0 190 424 455 467 445 561 754 733 727 543 Species 1,261 1,071 836 801 774 768 594 302 295 290 0

Rank 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Kingdom 0 0 0 0 0 0 0 0 0 0 27 Phylum 0 0 0 0 0 0 0 0 0 0 0 Class 0 0 0 0 0 0 0 0 0 0 15 Order 0 0 0 0 1 2 2 2 3 7 419 Family 0 2 7 19 43 75 136 266 315 347 393 Genus 0 290 529 561 587 568 689 938 902 892 693 Species 1,547 1,255 1,011 967 916 902 720 341 327 301 0

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 49 / 69 Taxonomic Classification in Practice Case Study II

Metagenomic data set Roche/454 and Illumina/Solexa sequence reads of 16S ribosomal RNA • D. C. Richter, F. Ott, A. F. Auch, R. Schmid, D. H. Huson (2008) Reference taxonomy 5,165 near-full-length type cultures of high quality • J. R. Cole, Q. Wang, E. Cardenas et al (2009) Alignment of sequence reads BWA/SW, BLAT and BWA, GEM Ambiguities Sequences with at most 2 mismatches (96% identity for 50 bp) or at most 3 mismatches (96% identity for 75 bp) Assignment of ambiguous sequence reads MOTHUR, RDP Classifier, TANGO Diversity measures Shannon-Wiener, Clarke-Warwick

P. Ribeca, G. Valiente (2011)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 50 / 69 Taxonomic Classification in Practice Case Study II

Simulated data sets Simulation Total Mean Diversity Richness Taxonomic Run Reads Length Diversity 454-100 99,784 107.021 8.5200 5,148 11.0149 454-250 99,387 264.767 8.5190 5,148 11.0146 Solexa-50 95,874 50.000 8.5191 5,148 11.0168 Solexa-75 94,567 75.000 8.5186 5,148 11.0172

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 51 / 69 Taxonomic Classification in Practice Case Study II

Number of simulated reads classified in terms of their number k of mismatches and indels k 454-100 454-250 Solexa-50 Solexa-75 0 5,255 237 44,287 48,861 1 15,206 946 34,463 32,540 2 22,528 2,271 13,170 10,493 3 22,363 4,636 3,292 2,254 4 16,634 7,776 577 367 5 9,737 11,184 77 44 6 4,834 13,305 8 8 7 2,073 13,994 8 813 13,180 9 252 10,851 10 67 8,129 11 17 5,531 12 5 3,349 13 1,990 14 or more 2,008

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 52 / 69 Taxonomic Classification in Practice Case Study II

Aligning simulated Roche/454 reads with BWA/SW and BLAT Data Set Tool Output Size Aligned Reads Unaligned Reads 454-100 BWA/SW 18M 35,999 57,329 BLAT 11G/13M 69,189 24,139 454-250 BWA/SW 39M 54,050 48,444 BLAT 14G/4M 63,094 39,400

Aligning simulated Illumina/Solexa reads with BWA and GEM Data Set Tool Output All Reads Up To 3 Mismatches Size Aligned Unaligned Aligned Unaligned Solexa-50 BWA 231M 89,421 6,453 89,416 265 GEM 511M 89,681 6,193 89,681 0 Solexa-75 BWA 87M 92,042 2,525 92,039 280 GEM 284M 92,319 2,248 92,319 0

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 53 / 69 Taxonomic Classification in Practice Case Study II

Taxonomic assignment with RDP Classifier Data Set T = 100% T = 90% T = 80% T = 70% Diversity Richness Diversity Richness Diversity Richness Diversity Richness 454-100 5.0099 1,449 4.4206 1,422 3.4922 1,373 1.5970 952 454-250 6.2334 1,444 6.0476 1,426 5.7182 1,407 4.1619 1,357 Solexa-50 3.9299 1,375 3.2198 1,302 2.3419 1,146 1.2636 534 Solexa-75 5.2719 1,446 4.8026 1,406 4.0484 1,341 2.1618 970

T is the confidence threshold

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 54 / 69 Taxonomic Classification in Practice Case Study II

Taxonomic assignment with MOTHUR Data Set k = 7 k = 8 k = 9 k = 10 Diversity Richness Diversity Richness Diversity Richness Diversity Richness 454-100 3.0214 649 3.0566 666 3.0695 653 3.0465 649 454-250 3.4070 662 3.4428 664 3.4538 678 3.4505 680 Solexa-50 2.6129 642 2.6448 633 2.6440 637 2.6304 637 Solexa-75 2.9011 628 2.9195 642 2.9286 643 2.9204 653

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 55 / 69 Taxonomic Classification in Practice Case Study II

Taxonomic assignment with TANGO Data Set BWA GEM Diversity Richness Taxonomic Diversity Richness Taxonomic Diversity Diversity Solexa-50 6.2114 3,595 8.0441 6.5149 3,739 8.3293 Solexa-75 6.4927 3,625 8.6793 6.7536 3,623 8.9289

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 56 / 69 Taxonomic Classification in Practice Case Study III

Metagenomic data set Sanger sequence reads selected at random from 113 microbial genomes • K. Mavromatis, N. Ivanova, K. Barry et al (2007) Reference taxonomy NCBI taxonomy for the 113 sampled microbial genomes • E. W. Sayers, T. Barrett, D. A. Benson et al (2009) Alignment of sequence reads BLAST Ambiguities Sequences with same E-value as the top BLAST hit Assignment of ambiguous sequence reads LCA, TANGO Diversity measures Clarke-Warwick

D. Alonso, J. C. Clemente, J. Jansson, G. Valiente (2011)

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 57 / 69 Taxonomic Classification in Practice Case Study III

Phylogenetic distribution of the 113 microbial genomes Kingdom Phylum Class Genomes Bacteria Actinobacteria Actinobacteria 9 Bacteroidetes Cytophagia 1 Chlorobi Chlorobia 7 Chloroflexi Chloroflexi 1 Cyanobacteria Cyanobacteria 6 Deinococcus-Thermus Deinococci 1 Firmicutes Bacilli 13 Clostridia 8 Proteobacteria Alphaproteobacteria 17 Betaproteobacteria 13 Gammaproteobacteria 25 Deltaproteobacteria 6 Epsilonproteobacteria 1 unclassified Proteobacteria 1 Archaea Euryarchaeota Methanomicrobia 3 Thermoplasmata 1

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 58 / 69 Taxonomic Classification in Practice Case Study III

Distribution of sequence reads in the metagenomic data set simLC simMC simHC Most abundant 28,860 22,954 2,384 2nd 9,277 16,576 2,248 3rd 5,168 10,483 2,191 4th 1,149 6,107 2,127 5th 1,109 4,850 2,083 6th 1,074 1,146 2,051 Rest 50,798 52,263 103,586 simLC Low-complexity microbial communities with one dominant population simMC Medium-complexity microbial communities with more than one dominant population flanked by low-abundance populations simHC High-complexity microbial communities with no dominant population

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 59 / 69 Taxonomic Classification in Practice Case Study III

Ambiguous sequence reads in the metagenomic data set Data set No hit One hit Ambiguous Total simLC 1 76,512 20,923 97,436 simMC 3 86,704 27,675 114,382 simHC 2 99,618 17,052 116,672

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 60 / 69 Taxonomic Classification in Practice Case Study III

Taxonomic distribution of the metagenomic data set using consensus (LCA) taxonomic assignment Data set Kingdom Phylum Class Order Family Genus Species simLC 19 104 134 56 2,785 5,295 88,935 simMC 24 176 174 101 2,784 5,219 105,731 simHC 39 219 230 111 822 11,164 103,852

Taxonomic distribution of the metagenomic data set using optimal (F-measure) taxonomic assignment Data set Kingdom Phylum Class Order Family Genus Species simLC 1 65 46 1,236 3,241 92,846 simMC 10 90 104 1,179 3,191 109,805 simHC 12 145 77 414 6,847 109,175

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 61 / 69 Taxonomic Classification in Practice Case Study III

Taxonomic diversity (Clarke-Warwick index) of the metagenomic data set set using consensus (LCA) and optimal (F-measure) taxonomic assignment Data set LCA F-measure Actual simLC 6.6874 6.8651 7.0786 simMC 6.2634 6.4367 6.6968 simHC 7.5488 7.6748 7.9361

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 62 / 69 • Taxonomic assignment using reference phylogenetic networks

Outlook

• Taxonomic assignment using multiple reference taxonomies

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 63 / 69 Outlook

• Taxonomic assignment using multiple reference taxonomies • Taxonomic assignment using reference phylogenetic networks

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 63 / 69 Outlook

• Taxonomic assignment using multiple reference taxonomies • Taxonomic assignment using reference phylogenetic networks

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 63 / 69 Outlook

• Taxonomic assignment using multiple reference taxonomies • Taxonomic assignment using reference phylogenetic networks

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 63 / 69 Acknowledgment

Daniel Alonso Jose´ C. Clemente Jesper Jansson Paolo Ribeca

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 64 / 69 References

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 65 / 69 References

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 66 / 69 References

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 67 / 69 References

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 68 / 69 References

Gabriel Valiente (UPC) Taxonomic Classification PLG 2011 69 / 69