Lecture 6: Introduction to Phylogenetics

Introduction to Bioinformatics for Computer Scientists Lecture 6 Machine Learning for MSA ● Few approaches ● None are widely used ● Problem: What is the training set? ● However, used for aligning natural languages for automatic paraphrasing ● http://dl.acm.org/citation.cfm?id=1073448 ● Only paper I found: ● “Support vector machine learning for interdependent and structured output space”, but more a general technique than specifically successful in MSA ● http://dl.acm.org/citation.cfm?id=1015341 ● Machine learning and similar techniques mostly for protein structure prediction based on given alignments ● Many post-analysis applications using machine learning to identify “interesting” regions in alignments Plan for next lectures ● Today: Introduction to phylogenetics ● Lecture 7: Phylogenetic search algorithms ● Lecture 8 (Alexis): The phylogenetic Maximum Likelihood Model ● Lecture 9: (Diego): Comparing & Testing Models ● Lecture 10: (Lucas): Discrete Operations on Trees The story so far ● Biological Terminology: RNA, DNA, genes, genomes, etc ● Pair-wise Sequence Alignment ● Sequence Comparison ● Genome Assembly ● Multiple Sequence Alignment The story so far ● Biological Terminology: RNA, DNA, genes, genomes, etc ● Pair-wise Sequence Alignment ● Sequence Comparison ● Genome Assembly ● Multiple Sequence Alignment ● Phylogenetic Inference A Phylogeny or Phylogenetic Tree A taxonomic subclass The ingroup This tree is unrooted The outgroup A Phylogeny or Phylogenetic Tree In Phylogenetics such a subtree is often also called Lineage! Phylogeny ● An unrooted strictly binary tree ● Leafs are labeled by extant “übrig geblieben” (currently living) organisms represented by their DNA/Protein sequences → we can also sequence ancient DNA, see, for instance, the neandertal genome: “The complete genome sequence of a Neanderthal from the Altai Mountains”, Nature 2013 → depends on temperature, time, and other environmental conditions → up to 300,000 years back, see http://www.pnas.org/content/110/39/15758.abstract ● Inner nodes represent hypothetical common ancestors ● Outgroup: one or more closely related, but different species → allows to root the tree Taxon ● Used to denote clades/subtrees in phylogenies or taxonomies ● A group of one or more species that form a biological unit ● As defined by taxonomists → subject of controversial debates → part of the culture/fuzziness of Biology ● In phylogenetics we often refer to a single leaf as taxon → the plural of taxon is taxa → we often say that a tree with n leaves (sequences) has n taxa Some more terminology This phylogeny has a root! D E A B C B and C are a monophyletic group; they are sister species Some more terminology D E A B C (A,B,C) is a monophyletic group; it is are sister to (D, E) Some more terminology D E A B C (A,B,C,D) is paraphyletic → E is excluded Some more terminology D E A B C (A,D) is a polyphyletic group → their most recent common ancestor (MRCA) is excluded Some more terminology 0.1 0.1 0.05 0.05 0.1 0.05 0.05 0.05 D E A B C Tree-based or patristic distance between two taxa: Sum over branch lengths along the path in the tree, e.g.: A ↔ B: 0.2 A ↔ D: 0.35 Tree Rooting Outgroup species 1 Ingroup species 3 Outgroup species 2 Ingroup species 2 Ingroup species 1 Tree Rooting Pull up Outgroup species 1 Ingroup species 3 Outgroup species 2 Ingroup species 2 Ingroup species 1 Tree Rooting root Ingroup species 3 Outgroup species 1 Ingroup species 2 Ingroup species 1 Outgroup species 2 Tree Rooting root This is just a drawing option! Tree inference algorithms treat ingroup and outgroup sequences mathematically in the same way! Ingroup species 3 Outgroup species 1 Ingroup species 2 Ingroup species 1 Outgroup species 2 Outgroup Choice Ingroup species 4 Ingroup species 3 Ingroup species 4 Ingroup species 3 Ingroup species 1 Ingroup species 2 ? Ingroup species 2 Ingroup species 1 Fuzzy signal Clear signal Close Outgroup Distant Outgroup Tree Inference Taxon 1:ACGTTT Taxon 1:ACGTTT- Taxon 2:ACGTT MSA Taxon 2:ACGTT-- Tree inference Taxon 3:ACCCT Program Taxon 3:ACCCT-- program Taxon 4:AGGGTTT Taxon 4:AGGGTTT Taxon 1 Taxon 3 Taxon 2 Taxon 4 Tree Inference Taxon 1:ACGTTT Taxon 1:ACGTTT- Taxon 2:ACGTT MSA Taxon 2:ACGTT-- Tree inference Taxon 3:ACCCT Program Taxon 3:ACCCT-- program Taxon 4:AGGGTTT Taxon 4:AGGGTTT Taxon 1 Taxon 3 Most widely-used alignment formats: ●PHYLIP ●NEXUS ●FASTA Taxon 2 Taxon 4 Tree Inference Taxon 1:ACGTTT Taxon 1:ACGTTT- Taxon 2:ACGTT MSA Taxon 2:ACGTT-- Tree inference Taxon 3:ACCCT Program Taxon 3:ACCCT-- program Taxon 4:AGGGTTT Taxon 4:AGGGTTT Taxon 1 Taxon 3 Most widely-used tree formats: ●NEWICK ●NEXUS Taxon 2 Taxon 4 Tree Inference Taxon 1:ACGTTT Taxon 1:ACGTTT- Taxon 2:ACGTT MSA Taxon 2:ACGTT-- Tree inference Taxon 3:ACCCT Program Taxon 3:ACCCT-- program Taxon 4:AGGGTTT Taxon 4:AGGGTTT Taxon 1 Taxon 3 Newick example: Remember that this is an unrooted tree! (Taxon1, Taxon2, (Taxon3,Taxon4)); or ((Taxon1, Taxon2), Taxon3,Taxon4); Taxon 2 Taxon 4 Top level trifurcation Tree Inference Taxon 1:ACGTTT Taxon 1:ACGTTT- Taxon 2:ACGTT MSA Taxon 2:ACGTT-- Tree inference Taxon 3:ACCCT Program Taxon 3:ACCCT-- program Taxon 4:AGGGTTT Taxon 4:AGGGTTT Taxon 1 Taxon 3 root ((Taxon1, Taxon2), (Taxon3,Taxon4)); Taxon 2 Taxon 4 Tree Inference Taxon 1:ACGTTT Taxon 1:ACGTTT- Taxon 2:ACGTT MSA Taxon 2:ACGTT-- Tree inference Taxon 3:ACCCT Program Taxon 3:ACCCT-- program Taxon 4:AGGGTTT Taxon 4:AGGGTTT Taxon 1 Taxon 3 0.1 Trees may have relative 0.15 0.3 branch lengths, depending 0.2 on the tree inference method 0.15 that was used Taxon 2 Taxon 4 Tree Inference Taxon 1:ACGTTT Taxon 1:ACGTTT- Taxon 2:ACGTT MSA Taxon 2:ACGTT-- Tree inference Taxon 3:ACCCT Program Taxon 3:ACCCT-- program Taxon 4:AGGGTTT Taxon 4:AGGGTTT Taxon 1 Taxon 3 0.1 Trees may have relative 0.15 0.3 branch lengths, depending 0.2 on the tree inference method 0.15 that was used Taxon 2 Taxon 4 Newick format with branch lengths: (Taxon1:0.1,Taxon2:0.2,(Taxon3:0.15,Taxon4:0.15):0.3); t4 t3 root t2 Non-ultrametric tree Non-ultrametric t1 t4 Tree Shapes Tree t3 root t2 Ultrametric tree Ultrametric t1 Evolutionary time Tree Shapes root root E v o l u t i o n a r y t i m t2 e t3 t1 t2 t3 t4 t4 Ultrametric tree t1 Non-ultrametric tree Most tree inference models/algorithms/programs produce non-ultrametric trees Tree Shapes root root E v o l u This is still relative time, for instance t i o the mean substitution rate per site n a r y t i m t2 e t3 t1 t2 t3 t4 t4 Ultrametric tree t1 Non-ultrametric tree t4 t3 root t2 Non-ultrametric tree Non-ultrametric t1 t4 Tree Shapes Tree t3 root t2 Ultrametric tree Ultrametric t1 How do we get real times? real we get do How Evolutionary time dated fossil dated t4 Dating Trees Dating t3 root t2 Ultrametric tree Ultrametric t1 Evolutionary time t4 Dating Trees Dating dated fossil 2 million years 2 million fossil dated t3 root t2 Ultrametric tree Ultrametric t1 Evolutionary time 1 million years 1 million 2 million years 2 million t4 Dating Trees Dating dated fossil 2 million years 2 million fossil dated t3 t2 Ultrametric tree Ultrametric Root: 3 million years 3 million Root: t1 Evolutionary time Dating Trees Root: 3 million years We need a rooted & E ultrametric tree! v o 1 million years → rooting with outgroups l u → ultrametricity with programs t i o dated fossil 2 million years for divergence time estimation n a → active research area r y → most codes rely on the phylogenetic 2 million years t i m likelihood function and Bayesian e Statistics (MCMC methods) t1 t2 t3 t4 Ultrametric tree Dating Trees Root: 3 million years But how do we place the fossil? → typically no DNA data available E v o 1 million years Fossil placement: l u → ad hoc using empirical knowledge t i o dated fossil 2 million years → computationally using n a morphological data r y 2 million years t i The input for a phylogenetic analysis m e need not be molecular data! t2 t3 t4 We can also use sequences of t1 morphological traits (“Merkmale”)! Ultrametric tree 2012 species! t4 t3 t2 Ultrametric tree Ultrametric t1 Remember that we deal with extant Evolutionary time Morphological Traits t1: 1000 t2: 0100 t3: 0010 T4: 0001 or: t1: 0 t2: 1 t3: 2 t4: 3 Morphological Traits t1: 1000 t2: 0100 t3: 0010 T4: 0001 or: t1: 0 t2: 1 t3: 2 t4: 3 Traits need not be discrete: they can also be continuous: e.g. bone ratios Alignment-Free Tree Inference Taxon 1:ACGTTT Taxon 2:ACGTT Pair-wise distances Tree inference e.g., pair-wise sequence program Taxon 3:ACCCT alignment scores Taxon 4:AGGGTTT Taxon 1 Taxon 3 Taxon 2 Taxon 4 Alignment-Free Tree Inference Taxon 1:ACGTTT Taxon 2:ACGTT Pair-wise distances Tree inference e.g., pair-wise sequence program Taxon 3:ACCCT alignment scores Taxon 4:AGGGTTT Taxon 1 Taxon 3 Alignment-free tree inference is typically less accurate → we have not established homology via a MSA Taxon 2 Taxon 4 How many unrooted 4-taxon trees exist? A C B D A B A B D C D C How many rooted 4-taxon trees exist? A C B D A B A B D C D C Tree Counts ● Unrooted binary trees ● 4 taxa → 3 distinct trees ● A tree with n taxa has n-2 inner nodes ● And 2n-3 branches ● Rooted binary trees ● 4 taxa → 3 unrooted trees * 5 branches each (rooting points) = 15 trees ● n-1 inner nodes ● 2n-2 branches The number of trees 3 taxa = 1 tree The number of trees 4 taxa: 3 trees u: # trees of size 4-1 := 1 v: # branches in a tree of size 4-1 := 3 Number of unrooted binary trees with 4 taxa: u * v = 3 The number of trees 5 taxa: 15 trees u = 3 v = 5 Number of unrooted trees with 5 taxa: 3 * 5 = 15 The number of trees 6 taxa: 105 trees u = 15 v =

Lecture 6: Introduction to Phylogenetics

Bioinfo 11 Phylogenetictree

Hemidactylium Scutatum): out of Appalachia and Into the Glacial Aftermath

Molecular Phylogenetics and Evolution 55 (2010) 153–167

Supertree Construction: Opportunities and Challenges

Definitions in Phylogenetic Taxonomy

Phylogenetic Supertree Reveals Detailed Evolution of SARS-Cov-2

Phylogenetic Comparative Methods

Traits Evolving on a Tree

Phylogeny of Angiosperms

Biological Classification Cladistics Cladistics

Basics of Cladistic Analysis

Increasing Data Transparency and Estimating Phylogenetic Uncertainty in Supertrees: Approaches Using Nonparametric Bootstrapping