Phylogeny Systematics & Taxonomy
Total Page:16
File Type:pdf, Size:1020Kb
Phylogeny Systematics & Taxonomy Wayne Maddison 25 March 2003 Beginning 1750’s: • History + basic ideas What are the species and how are they related? • Methods for reconstructing phylogeny • Applications of phylogeny Description of species & By ca. 1950: arrangement • hundreds of into groups thousands of (classification) species described • general idea of phylogeny, at least for multicellular creatures (2) Phylogeny & classification were Systematics of mid 20th century: permitted to differ (1) Focus was not so much classification & Phylogeny: Classification: phylogeny; focus was distinguishing species. Reptiles Phylogeny inferred usually without explicit data or turtles mammalscrocodilesbirds lizards turtles analyses (i.e. seat of the pants) lizards crocodiles Birds Mammals Monophyletic group: Paraphyletic group: an ancestor and all of its descendants an ancestor and some but not all of its descendants or a group consisting of species more closely related to Must be contiguous piece of tree (i.e. Archaeopteryx + each other than to any other species not in the group ostrich + crow isn’t paraphyletic) or a group whose most recent common ancestor is more recent than any common ancestor shared with species outside the group Polyphyletic group: neither monophyletic nor paraphyletic 1950’s & 60’s formalizing methods begins i.e. discontiguous pieces of the phylogenetic tree Hennig 1950 (German), 1966 (English) (1) Formal logic for reconstructing phylogeny (2) Classification should match phylogeny (all groups monophyletic) Before Hennig: If a group of species is closely related, they should apomorphy: derived trait share traits synapomorphy: shared derived trait Hennig: plesiomorphy: ancestral trait If a group of species is monophyletic, they should plesiomorphy apomorphy share traits derived within the containing group. Other types of groups (paraphyletic, polyphyletic) are not expected to possess derived traits uniquely Therefore sharing of derived traits is the indicator of monophyly. Synapomorphy indicates monophyly ABCDE FG 1960’s: quantification Papilio 0 2 1 0 0 1 0 Nymphalis 0 1 0 1 1 0 1 Distance methods Pieris 1 000002 Formal coding of data into matrix of characters Danaus 0 1 0 1 1 0 1 Battus 0 2 1 0 0 1 0 and character states Heliconius 0 0 0 0 1 1 0 Colias 1 0 0 0 0 0 2 Character with states 0, 1 Data matrix some loss of information Tree ABCDE FG Distance matrix Papilio 0210010 Papilio Nymphalis Pieris Danaus Battus Heliconius Colias Nymphalis 0101101 Papilio 0 Pieris 1000002 Nymphalis 0.5 0 Danaus 0101101 Pieris 0.6 0.5 0 Battus 0210010 Danaus 0.5 0.4 0.4 0 Heliconius 0000110 Battus 0.2 0.6 0.7 0.6 0 Colias 1000002 Heliconius 0.9 0.8 0.8 0.8 0.9 0 Colias 0.7 0.5 0.3 0.5 0.7 0.8 0 ABCDE FG Papilio 0 2 1 0 0 1 0 Distance methods Nymphalis 0 1 0 1 1 0 1 Optimality methods Pieris 1 000002 UPGMA Danaus 0 1 0 1 1 0 1 Battus 0 2 1 0 0 1 0 Examine trees, Neighbor Joining Heliconius 0 0 0 0 1 1 0 Colias 1 0 0 0 0 0 2 choose tree that Data matrix optimizes some Optimality methods ? criterion Parsimony ? Tree 5 ? Tree 4 — seeks tree minimizing evolutionary change ? ? Tree 3 Tree 2 Likelihood — seeks tree maximizing probability Tree 1 of observed data Likelihood: the probability of the data observed given the hypothesis and assumptions P(Data | Parsimony Hypothesis) — counting steps, examples Goal: to find that hypothesis that maximizes the — seeks “simplest explanation” (minimizes ad hoc probability hypothesis against contradictory evidence) A Observed Data B A ATTGTA C B ATTGTA C ACCGCA D D ACCGCA If this were the tree ... what would be the probability of these sequences evolving? Likelihood Probability of evolving sequences doesn't depend simple example only on tree A Observed Data B A ATTGTA C B ATTGTA What is ABCD C ACCGCA D D ACCGCA probability of 1100 these states Also depends on: evolving if -the ancestral sequence, or the probabilities of various possible ancestral sequences probability of rates of mutation per unit time change on each -the probabilities of mutations per unit time branch is 0.1? -the times involved (branch lengths) Model of sequence evolution: Example of likelihood analysis using JC 69 -probabilities of bases at ancestor -base rates of mutations per unit time A A C G T For each candidate tree, -site to site rate variation A - α α α Rates of mutation per unit time B probability of sequences evolving will depend on C α - α α Rates of change C Simplest: branch lengths and the α G α α - α Jukes-Cantor 1969 A C G T D parameter of the JC model T α α α - A,C,G,T equally A - α α α probableAll substitutions at ancestor C α - α α Search: Try alternative trees, branch lengths and ANDequally all likely rates equal α G α α - α 's to find combination maximizing probability of observing the data T α α α - — simultaneous estimation More complex, realistic models More complex, realistic models GTR (General HKY 85 time reversible) A C G T A C G T transitions and all changes can A - αfC βfG γfT transversions can A - αfC βfG αfT differ in rate as differ in rate long as C αfA - δfG εfT C αfA - αfG βfT symmetrical G βfA δfC - ρfT equilibrium base G βfA αfC - αfT frequencies equilibrium base T γfA εfC ρfG - T αfA βfC αfG - might not be frequencies equal might not be equal One model of site-to-site rate variation: Another issue: not all sites evolve at the same rate The gamma distribution - has one parameter, the "shape" Protein coding: 2nd positions much slower, third positions and introns fastest Non-coding: e.g., ribosomal or tRNA: areas not α = 200 α = 0.5 vital to secondary structure or interactions may frequency of sites evolve quickly with that rate α = 2 α = 50 slow fast rate of evolution Example of likelihood analysis using GTR + gamma Better likelihoods with more complex model A C G T A fewer parameters A - αfC βfG γfT B gamma rate tree + JC (rate) + C αfA - δfG εfT C variation δf ρf G βfA C - T tree + HKY (transition & transversion ρf D T γfA εfC G - rates + equil. freq.) Search: Try alternative trees, branch lengths tree + GTR (6 rates + equil. freq.) + substitution rate parameters, equilibrium base gamma rate variation more parameters frequences and gamma shape parameters to find combination maximizing probability of observing the data but should use model only as complex as you need! — simultaneous estimation of model & tree Can use likelihood ratio tests to test significance Practical difficulties: computation Parsimony — originally justified as “simplicity of explanation” Searching among all possible Likelihood -trees — statistically best justified, but imposes assumptions -branch lengths of uniformity of process across characters & branches -rate matrix parameters -equilibrium base frequencies of tree -gamma shape parameters both make explicit predictions about character to find the combination maximizing probability distributions, and assess disagreement between is not easy! predicted & observed character distributions Difficulties of searching for optimal trees Can’t search exhaustively among all number of taxa* number of possible trees trees 33 415 5 105 Heuristic search (“good guess”) 6 945 7 10395 10 3.4 e7 — Build initial tree by adding taxa 20 8.2 e21 30 4.95 e38 — Hill-climbing algorithm making adjustments to 1,000,000 1 e5,866,723 tree *terminal taxa, OTU’s at each step check optimality criterion to see what The age of the universe is about 5 e29 adjustment to make picoseconds Success? Distance methods (popular 60’s & 70’s; Neighbor Joining brought back in 90’s) Optimality methods Parsimony (popular 80’s & 90’s; remains popular with morphological data) Likelihood (popular 90’s & 00’s) Trees from four separate gene regions Tree from visible structures & color Jumping spiders: different data agree.