Systematics - BIO 615

Outline 1. A few more Data issues (weighting, taxon sampling)

2. Tree terminology

3. - Hennig’s method, polarity

4. Optimality Criteria: Parsimony

Derek S. Sikes University of Alaska

Four steps - each should be explained in methods Saturation

1. Character (data) selection (not too fast, not too Saturation graph slow) “Why did you choose these data?” as time proceeds DNA distances also increase, to a point of saturation 2. Alignment of Data (hypotheses of primary In extreme cases homology) “How did you align your data?” the data become DNA essentially Distances randomized & all 3. Analysis selection (choose the best model / phylogenetically actual valuable method(s)) - data exploration “Why did you observed chose your analysis method?” information is gone

4. Conduct analysis time

20 Amino Acids & stop codon! Saturation

3 Letter Code 1 Letter Code Full name mRNA nucleotide triplets (codons) ! Ala ! !A Alanine !GCA, GCC, GCG, GCU ! One way to help correct saturated data is to ignore Arg ! !R ! Arginine !AGA, AGG, CGA, CGC, CGG, CGU ! Asn ! !N ! Asparagine !AAC, AAU ! the 3rd position sites - which will saturate faster than Asp ! !D ! Aspartic acid !GAC, GAU ! Cys ! !C ! Cysteine !UGC, UGU ! the 1st or 2nd. Glu ! !E ! Glutamic acid !GAA, GAG ! Gln ! !Q ! Glutamine !CAA, CAG ! Another method Gly ! !G ! Glycine !GGA, GGC, GGG, GGU ! His ! !H ! Histidine !CAC, CAU ! is to translate the Ile ! !I ! Isoleucine !AUA, AUC, AUU ! DNA Leu ! !L ! Leucine !UUA, UUG, CUA, CUC, CUG, CUU ! DNA to Amino Lys ! !K ! Lysine! !AAA, AAG ! Distances Met ! !M ! Methiodine !AUG ! Acid sequences Phe ! !F ! Phenylalanine !UUC, UUU ! and use those Pro ! !P ! Proline !CCA, CCC, CCG, CCU ! actual Ser ! !S ! Serine! !AGC, AGU, UCA, UCC, UCG, UCU ! instead Thr ! !T ! Threonine !ACA, ACC, ACG, ACU ! observed Trp ! !W ! Tryptophan !UGG ! Tyr ! !Y ! Tyrosine !UAC, UAU ! Val ! !V ! Valine! !GUA, GUC, GUG, GUU ! STOP UAA, UAG, UGA! Recall the 3rd codon position changes the most and the 2nd time position changes the least

1 - BIO 615

DNA structure Saturation Transitions & Transversions

Recall: A, G = purines T, C = pyrimidines

most mutations are transitions [Ts] (replacement of a purine with a purine, or a pyrimidine with a pyrimidine)

Transversions [Tv] (purine replaced by a pyrimidine) are less common

Saturation Saturation Another option to deal with saturation This is essentially an extreme form of weighting (a method of correcting the data) Convert DNA data to purines & pyrimidines transitions are weighted = 0 and transversions in matrix use R for purines, Y for pyrimidines are weighted = 1

Removes all the transitions from the data & some opt for less extreme weighting: leaves only transversions e.g. transitions = 1 transversions = 2 Less extreme than using only amino acids (which may ignore a lot of signal in the The logic is the less common evolutionary DNA data) change should be less homoplasious & have more phylogenetic information (signal)

Weighting Sample Size Some (mostly cladists) consider weighting (or correcting) of data in any way to be wrong For morphological data one typically They argue that the subjectively decided examines tens to hundreds of specimens weighting scheme may alter the results when scoring characters

In which case the results are not due to the For molecular data such sample sizes are signal in the data but due to the weighting much more costly scheme itself - typically people sequence one (but there is no such thing as “unweighted” - or, preferably, a few specimens per data - this would be “equally weighted” data) taxon of interest

A Big issue that we will touch on repeatedly

2 Systematics - BIO 615

Sample Size Sample Size Problems with one sequence per taxon: Lab / human error:

- Much harder to identify cases of gene 1. Three major parasitoid wasp taxa - one of the first molecular phylogenies. Agreed with morphology. Later tree problems, xenology, discovered the sequences were those of cow, pig, and hybridization, or introgression human! Got the “correct” tree by chance. Apocryphal? story reported by Godfray & Knapp (2004)

- Also no way to check for lab error: 2. Odd, published, phylogeny of mammals that - could have been a misidentification suggested the blue whale was a primate - due to cut & - or a tube mix-up paste errors

Godfray, H. C.J.,S. Knapp (2004) Introduction to a theme issue ‘Taxonomy for the twenty first century’ Phil. Trans. R. Soc. Lond. B 359: 559-569.

Page, R. D. M. & Charleston, M. A. 1999 Comments on Allard & Imaginary example of Carpenter (1996), or the ‘aquatic ape’ hypothesis revisited. possible misidentification Cladistics 15, 73–74. or introgression etc.

If dataset had only 1 Amazing that Allard & specimen per this Carpenter published this problem would not be tree apparent

Less obvious mistakes might be be scattered interruptus3 among the literature

So, when possible, verify the work of others…

Sample Size Phylogenetic Trees - Terminology Rule of thumb: Phylogenetic trees are hypotheses or estimates of Should use multiple samples one rank evolutionary relationships below the taxon of interest - trees are an explanation of the data (how the data came to be the way they are) e.g. - if studying relationships among species then sample multiple If we assume: 1. Our characters are independent specimens per species 2. Our character states are homologous (& genes orthologous) - if studying relationships among 3. Evolution has happened genera then sample multiple species per genus etc. We can infer the evolutionary relationships among organisms

3 Systematics - BIO 615

Phylogenetic Trees - Terminology Phylogenetic Trees - Terminology Phylogenetic inference: Terminals, LEAVES, TIPS or OTUs Peripheral or terminal branches Inference: drawing conclusions from premises A B C D E F G H I J

Deduction: if premises true then inference is true node 2 node 1 Polytomy or Induction: if premises true then inference is multifurcation probably true interior branches A In we can never do better than

“probably” true ROOT ((A,B)C))(((H,I,J)(F,G))(D,E)) - Newick tree format

Phylogenetic Trees - Terminology Phylogenetic Trees - Terminology Phylogram These are all the same trees Cladogram branch lengths Branch lengths proportional to time and / or change meaningless E Relationships only D C

A B C D E H I J F G A B G I F H J

(((D,C)B)A) ABSOLUTE TIME or DIVERGENCE

Phylogenetic Trees - Terminology Phylogenetic Trees - Terminology

Dendrogram or Chronogram archaea Trees - Rooted and Unrooted branch lengths proportional to time eukaryote archaea Require some estimate of a Unrooted tree aka Ultrametric trees archaea eukaryote eukaryote A B C D E H I J F G eukaryote

Rooted bacteria outgroup by outgroup archaea archaea Monophyletic group archaea eukaryote eukaryote Monophyletic ABSOLUTE group TIME or root eukaryote DIVERGENCE eukaryote

4 Systematics - BIO 615

Phylogenetic Trees - Terminology Phylogenetic Trees - Terminology Trees - Rooted and Unrooted Unrooted tree - a network that depicts the relationships among the OTUs but not the direction of evolution

- most methods / programs build unrooted trees - these can be later rooted to establish the direction of evolution (and the polarity of the character states)

Rooted tree - there are various ways to root a tree Unrooted Rooted using Nori - most common is by addition of an outgroup - where the outgroup joins the ingroup is the root!

Phylogenetic Trees - Terminology Phylogenetic Trees - Terminology Trees - Rooted and Unrooted Trees - Rooted and Unrooted

The seven rooted trees that How many more rooted can be derived from an trees are there than unrooted tree for five unrooted? sequences. Each rooted tree 1-7 corresponds to placing the root on the corresponding numbered branch of the unrooted tree.

How many more rooted trees are there than Explains why programs unrooted (in general, that tend to search on is) ? unrooted trees

Phylogenetic Trees - Terminology Phylogenetic Trees - Terminology

Resolution - a tree with no polytomies is fully resolved CLADE - a set of OTUs that includes all the OTUs descended from their most recent common ancestor (if the set is given a Hard polytomy = simultaneous divergence; rapid radiation name, the named group would be no amount of data will resolve monophyletic)

Soft polytomy = divergence not - not all clades are given names simultaneous, but data currently insufficient to resolve - typically, we use ‘clade’ when talking Difficult to distinguish in practice!! about trees and ‘monophyletic’ when talking about taxonomic names (classifications) !!

5 Systematics - BIO 615

Phylogenetic Trees - Terminology Phylogenetic Trees - Terminology

TOPOLOGY - the branching patterns of the tree NODE - ancestral OTU (all that a cladogram depicts) note: Hennig’s cladistic method requires explicit distinction between anagenesis and BRANCH LENGTH - proportional to amount of cladogenesis (with the latter being the evolutionary change (rate x time) default) - thus most fossils and potential ‘ancestors’ are not depicted as nodes but a branch can be long because rather as OTUs - are not - the lineage has a higher rate of evolution evolutionary trees or - the rate of evolution is the same but the lineage has had more time to evolve than other lineages or a combination of these…

Phylogenetic Trees - Terminology Cladistics Hennig’s original method: 1. Distinguish homologies from analogies Remane’s criteria

2. Distinguish derived homologies (apomorphies) from ancestral (plesiomorphies) homologies

3. Tree then built from apomorphies (evidence of common ancestry)

Cladogram Evolutionary tree

Cladistics Polarity Hennig’s original method: (not really used in modern Why does polarity matter? analyses)

Distinguish derived homologies Direction of evolution: Polarity (apomorphies) from ancestral • apomorphy: derived trait (plesiomorphies) homologies • plesiomorphy: ancestral trait

e.g. which state is the apomorphy? - determines the direction of evolution Porifera Cnidaria - determines the polarity of the characters “Bilateria” Cnidaria Cnidae present Cnidae absent - “to polarize the characters”

6 Systematics - BIO 615

Polarity Polarity Why does polarity matter? How does one polarize characters?

- Hennig’s method was to do this a priori - before Porifera “Bilateria” Cnidaria Cnidaria Cnidaria Cnidaria Porifera “Bilateria” the analysis using information from:

present absent 1. Fossil record - younger states are derived 2. Ontogeny - derived states usually appear later in developmental sequence 3. Outgroup comparison - given 2 states in Porifera Cnidaria your ingroup, the state also found in the outgroup is the ancestral state Cnidae present Cnidae absent “Bilateria” Cnidaria [an old, bad method: common = primitive e.g. placental vs marsupials]

Polarity Polarity Hennigian terminology The Hennigian / Cladistic method’s success (in contrast nouns to phenetics) was attributed to the 2-step process of apomorph - derived character state autapomorph - derived character state not shared 1) careful homology assessment synapomorph - shared derived character state 2) a priori character polarization plesiomorph - ancestral character state synplesiomorph - shared ancestral character If these issues are ignored you will have groups based state on homoplasy (convergence) or plesiomorphy and they won’t be monophyletic adjectives apomorphic, autapomorphic, etc.

Polarity Polarity Paraphyletic - Taxon based on plesiomorphy Paraphyletic - A taxon with the most -e.g. Reptiles recent common ancestor and some but not all of the descendents - left out descendents who have apomorphic state(s) (feathers or fur) - a common mistake in systematics Error made of thinking two homologous states (e.g. feathers & - a “natural but incomplete” group scales) are not homologous

Polyphyletic - Taxon based on Polyphyletic - A taxon without the most homoplasy recent common ancestor - e.g. a group that included all marine vertebrates Error made of thinking two non- - a rare mistake in systematics homologous states are homologous - an “unnatural” group Not a vulture

7 Systematics - BIO 615

Polarity Polarity

Monophyletic - A taxon with the most recent common ancestor During the 1980s it was determined that polarity did not need to be determined before the analysis (a priori polarization of the and all of the descendents original Hennigian method was unnecessary)

- a taxon based on derived homologies (synapomorphies) You might think it is critical to prevent the false group on the right

- a natural group Porifera “Bilateria” Cnidaria Cnidaria Cnidaria Cnidaria Porifera “Bilateria” - whether the ancestor is in the taxon or not is another way of Cnidae Cnidae saying whether the taxon is based (defined) by homologous present states (recall definition of homology) absent

- If OTUs share homologous states they do so because they But these trees are identical when unrooted! They are the inherited them from their common ancestor same tree!

Polarity Polarity

If we use Porifera as the outgroup to root the tree we get the A priori polarization is obviously a problem with DNA data correct character polarities - How would one know before analysis which nucleotide was We do not need to know that ‘presence of cnidae’ is the ancestral and which derived? apomorphic state before analysis - Outgroups could be used to tediously attempt to polarize

Porifera “Bilateria” Cnidaria Cnidaria Cnidaria Cnidaria Porifera “Bilateria” thousands of characters before analysis

Cnidae Cnidae - Or one could simply include the outgroup in the analysis and present absent root the tree where the outgroup joins

But we do need to know what group is the outgroup! This provides us the polarities for all the characters instantly This is critical to proper understanding of character evolution

Polarity Parsimony

Well, maybe not all the characters… Because character conflict, homoplasy, is common we need a method to resolve this conflict - if there is only one state in the ingroup and one state in the outgroup there is no way to tell which state is ancestral and which is derived - We can brush aside the problem and use an algorithmic method, like neighbor-joining which builds one tree from distance data [NOT - you need a further out out-group to arbitrate recommended] - more on this later outgroup2 ingroup outgroup

A A T - Or we can use an optimality criterion that allows us

- Also, never assume that all the - Provides a means to solve the to rank alternate trees from best to worst states in the ingroup are derived problem of the indel: and all the states in the outgroup - -was it a deletion or an insertion? are ancestral - -like: were the wings gained? Or - What if you switched groups? lost? - Sample outgroup densely too

8 Systematics - BIO 615

Parsimony Parsimony

Parsimony is an optimality criterion Parsimony will allow one to find the tree that minimizes Prefer the tree or trees that minimizes the amount of homoplasy, aka the shortest tree evolutionary change required to explain the data but if you have made careless homology decisions Based on Ockham’s Razor “shave away all that is (e.g. poorly aligned your data) even the most unecessary” - plurality should not be posited without parsimonious tree may be horribly wrong necessity; when there are multiple hypotheses that explain the data equally well, choose the simplest one Thus some cladists emphasize that we don’t use parsimony because it is the method most likely to find Choice of simplest hypothesis is a good rule of thumb the true tree - we use it because it provides the “least (but remember, the data matter far more than the falsified” hypothesis (truth is unknowable) method!) “best tree” not “true tree”

Parsimony Parsimony

Glass is half full justification of parsimony: Two problems to solve “Parsimonious hypotheses can be defended without resorting to special knowledge, authoritarianism, or a 1. Determine optimality criterion score (tree length) priorisms” for each tree (easy / fast)

Glass half empty justification: 2. Search over all possible trees to find the tree(s) “If we have no biological theory to help choose among that is/are the best according to the hypotheses we resort to choosing the simplest out of optimality criterion (e.g. shortest; hard for ignorance, but not because its more likely to be true” more than 11 OTUs)

Parsimony Various algorithms are used to count changes on trees

PAUP* can score over 100k

CHARACTERS trees / sec Human Bat Cocodile Frog Bird 1 2 3 4 5 6 Kangeroo 6 6

hair 4 wings antorbital amnion

placenta lactation fenestra 5 Tree 1 Frog ------3 2 Bird + - - - + +

1

Crocodile + - - - + -

T A X A X A T Kangeroo + + + - - - Cocodile Human Bird Bat Frog Kangeroo Bat + + + + - + 4 5 3 TREE 2 4 Human + + + + - - 5 LENGTH 6

Tree 1 1 1 1 1 1 2 7 FIT Tree 2 Tree 2 1 2 2 2 2 1 10 3 2 Of these two trees, Tree 1 has the shortest length 1 and is the most parsimonious Both trees require some homoplasy (extra steps)

9 Systematics - BIO 615

Would take PAUP over 2 billion years to evaluate all trees for 21 OTUs Tree searching

Number of Taxa Number of unrooted trees Finding the shortest trees belongs in a class of famously 4 3 5 15 difficult problems called NP-complete problems 6 105 7 945 3 8 10,395 Imagine we could do much better than PAUP 10 trees/s 9 135,135 10 2,027,025 12 11 34,459,425 - say we had a massively parallel computer with 10 12 654,729,075 processors and each could calculate lengths for 1012 13 13,749,310,575 trees /sec. With this we could examine all trees for 20 14 316,234,143,225 -4 15 7,905,853,580,625 For n taxa OTUs in about 2x10 seconds. 16 213,458,046,676,875 # trees= 17 6,190,283,353,629,375 (2n-5)! 18 191,898,783,962,510,625 ------How long for this setup to do 50 OTUs? 19 6,332,659,870,762,850,625 2n-3(n-3)! 42 32 20 221,643,095,476,699,771,875 9 x 10 years (over 10 times the age of the universe) 21 8,200,794,532,637,891,559,375

Tree searching Example If 11 or fewer OTUs can do an exhaustive search - this guarantees the shortest tree(s) will be found Cho et al. 1988. A cladistic analysis of the genus Necrophorus (Coleoptera, Silphidae). If 12-25 OTUs can do a branch and bound search Nature and Life 18(1): 9-13 - this also guarantees the shortest tree(s) will be found but not all trees are examined Tree obtained by hand-calculations

For more than 25 OTUs must use other methods, length = 33 steps heuristic searching - approximate methods - do not guarantee the shortest tree will be found - can get trapped in local optima while searching for global optima (shortest trees) (to be continued…)

Parsimony Example See future lectures for follow up topics on… My PAUP analysis of their data: - analysis of large datasets & heuristic Their “best” tree = length 33 steps searching PAUP found 43,021 trees of length 33 - how parsimony performs relative to other 2,971 trees of length 31 methods 512 trees of length 30 29 trees of length 29 - with what type of datasets it is safe to use parsimony

10 Systematics - BIO 615

Terms - from lecture & readings Terms - from lecture & readings

Purines, pyrimidines Dendrogram Transitions, transversions Chronogram Parsimony Weighting Ultrametric tree Optimality criterion Data correction Unrooted vs rooted tree Ockhamʼs razor Inference Outgroup Paraphyletic Deduction Ingroup Polyphyletic Tree resolution Induction Monophyletic Hard polytomy Newick tree format Soft polytomy Exhaustive search Cladogram Rapid radiation Branch and bound search Phylogram Clade Heuristic search Tips, leaves, terminals, OTUs Topology Branches Branch length Nodes Polarity Polytomy Apomorph, plesiomorph Multifurcation Autapomorph, synapomorph root etc.

Study questions

What are some methods to deal with saturated data?

What is the rule of thumb for phylogenetic sample size and why?

Distinguish a hard from a soft polytomy. How might one do so in practice?

What are two reasons a branch might be long?

Why does polarity matter?

Contrast Hennigʼs original method to produce a dataset with the modern method.

11