<<

Reconstructing the of

Carol E. Lee University of Wisconsin Copyright©2020; do not upload without permission • In the lecture, I talked about a “Phylogenetic Concept”

–What is a “Phylogeny?” –How do you construct one? –Why on earth should I care?

2 Why you should care: • All biological relationships can be determined by constructing phylogenies: Even if phylogenies are not always the best way to define species boundaries, they do indeed tell you the genetic and evolutionary relationships among groups and individuals – Your ancestry – Diseases—figure out evolutionary origins and evolutionary pathways of disease, like HIV, Ebola, SARS, etc. – Crops and livestock (food security)—rescue from , create new varieties – Endangered Species— figure out how endangered populations are related and how to perform genetic rescue 3 Web Project

http://www.tolweb.org/tree/ Tree of Life 2016 Hug et al. 2016 Microbiology

Bacteria

Eukarya

Archaea Outline

1. What is a phylogeny?

2. How do you construct a phylogeny? The Statistical Methods Think about relationships among the major lineages of life and when they appeared in the fossil record

Are Genetic Distances and fossil record roughly congruent? Fossil Record vs Molecular Clock

• Molecular clock and fossil record are not always congruent – Fossil record is incomplete, and soft bodied species are usually not preserved – rates can vary among species (depending on generation time, replication error, mismatch repair)

• But they provide complementary information – Fossil record contains extinct species, while molecular data is based on extant taxa – Major events in fossil record could be used to calibrate the molecular clock Evolutionary History of HIV

HIV evolved multiple times from SIV (Simian Immunodeficiency Syndrome)

Evolutionary Analysis Time Freeman& Herron, 2004 Charles (1809 -1882)

On of Species (1859)

– Living species are related by common ancestry

– Change through time occurs at the population not the organism level

– The main cause of adaptive is Darwin envisaged evolution as a tree

The affinities of all the beings of the same have sometimes been represented by a great tree. I believe this simile largely speaks the truth…… …The green and budding twigs may represent existing species; and those produced during former years may represent the long succession of extinct species….. ….the great Tree of Life….covers the earth with ever-branching and beautiful ramifications

Charles Darwin, ; pages 131-132 Reconstructing the Tree of Life The only figure in The Origin of Species What did people believe before Darwin?

Lamarck proposed a ladder of life

Past Future Jean-Baptiste Lamarck

• French Naturalist (1744-1829) • “Professor of Worms and ” in Paris

• The first scientific theory of evolution (inheritance of acquired traits) Lamarck’s View of Evolution

Being God

• Continuum between physical Angels Realm and biological world (followed of Being ) Demons Man • Scala Naturae (“Ladder of Life” or “Great Chain of Realm of Becoming Being”) Non-Being What is wrong with a ladder?

• Evolution is not linear but branching

• Living organisms are not ancestors of one another

• The ladder implies progress What is right with the tree?

• Evolution is a branching process • If a mutation occurs, one species is not turning into another, but there is a split, and both lineages continue to evolve

• So, evolution is not progressive - all living taxa are equally “successful”

• Phylogenies () reflect the hierarchical structuring of relationships The only figure in The Origin of Species The Tree of Life is a Fractal Genealogical structures

• Phylogeny – A depiction of the ancestry relations between species (it includes speciation events) – Tree-like (divergent)

• Pedigree – A depiction of the ancestry relations within populations – Net-like (reticulating) Four butterflies connected to their parents

offspring

parents Population Individuals

past future / Population Phylogeny Species Lineage Speciation What happened here? - branching What happened here?

Extinction Representation of phylogenies?

A B C A B C

A simplified The True History representation Some terms used to describe a

Taxon (taxa) Tip

Internal branch Internode

Node (Speciation event)

Root Outline

1. What is a phylogeny?

2. How do you construct a phylogeny? The Molecular Clock Statistical Methods What is a Phylogeny?

• A phylogenetic tree represents a hypothesis about evolutionary relationships

• Each branch point represents the divergence of two taxa (e.g. species)

• Sister taxa are groups that share an immediate common ancestor Branch point (node) A

Taxon B Sister taxa Taxon C ANCESTRAL LINEAGE Taxon D

Taxon E

Taxon F Common ancestor of taxa A–F Polytomy (unresolved branching point) Molecular Clock • Phylogenies rely on the “Molecular Clock,” namely the fact that on average, occur at a given rate

• So, on average, more mutational differences between taxa means that they branched from a common ancestor longer ago

Example: • So longer branches on Mitochondria: 1 mutation every phylogeny often à greater

~2.2%/million years evolutionary distance 31

Phylogeny of 53 (Homo sapiens) just based on mtDNA

A different locus might yield a different tree

The horizontal branch lengths reflect genetic distance ≈ # of mutations Cladogram of mitochondrial cytochrome oxidase II alleles in humans and the African Great (Ruvolo et al. 1994)

This is not a phylogeny, but a cladogram.

A cladogram shows the hierarchical relationships among the taxa, but the branch lengths do not reflect evolutionary time. Molecular Clock Problem: mutation rate can vary among species

• Mutation rate is faster: – Shorter generation time (greater number of meiosis or mitosis events in a given time) – Replication Error (e.g. Sloppy DNA or RNA polymerase; poor

mismatch repair mechanisms) 35 Species Panthera Felidae

Panthera pardus Taxidea Carnivora

Mustelidae Taxidea taxus Lutra Lutra lutra

Canis

Canis latrans

Canis lupus A monophyletic consists of an ancestral taxa and all its descendants

A A A B Group I B B

C C C D D D

E E Group II E Group III F F F

G G G (a) Monophyletic group (clade) (b) Paraphyletic group (c) Polyphyletic group A B Group I

C D

E F

G (a) Monophyletic group (clade)

(In the lecture on species concepts we discussed that the “smallest” monophyletic group is a “phylogenetic species”) Examples of Paraphyletic Groups (not recognized as legitimate groups in the Phylogenetic , which only recognizes monophyletic groups)

Synapomorphies

• Synapomorphies are shared derived homologous traits

• They can be DNA nucleotides or other heritable traits

• They are used to group taxa that are more closely related to one another

synapomorphies

Sometimes similar looking traits are not homologous, and are not synapomorphies, but are the result of

How do we construct Phylogenies? Phylogenetic Methods

• Parsimony: Minimize # steps

• Distance Matrix: minimize pairwise genetic distances

• Maximum Likelihood: Probability of the data given the tree

• Bayesian: Probability of the tree given the data Parsimony

Uses Discrete Characters (like mutations, or some heritable trait)

Select the tree with the minimum number of character-state transitions summed across all characters Fig. 26-15-1 Parsimony: Example 1

Species I Species II Species III

Three phylogenetic hypotheses: I I III

II III II

III II I Fig. 26-15-2

Site 1 2 3 4 1/C Species I C T A T I 1/C I III

Species II C T T C II III II 1/C Species III G C A A III II I Ancestral A G T T 1/C 1/C sequence Fig. 26-15-3

Site 1 2 3 4 1/C Species I C T A T I 1/C I III

Species II C T T C II III II 1/C Species III G C A A III II I Ancestral A G T T 1/C 1/C sequence 3/A 2/T 3/A I 2/T I 3/A 4/C III II III II 4/C 4/C 2/T III II I 3/A 4/C 2/T 4/C 2/T 3/A Fig. 26-15-4

Site 1 2 3 4 1/C Species I C T A T I 1/C I III

Species II C T T C II III II 1/C Species III G C A A III II I Ancestral A G T T 1/C 1/C sequence 3/A 2/T 3/A I 2/T I 3/A 4/C III II III II 4/C 4/C 2/T III II I 3/A 4/C 2/T 4/C 2/T 3/A

I I III

II III II

III II I 6 events 7 events 7 events Parsimony: Example 2 Three possible trees

O C B A O A OO A B C C

C B A B Tree 1 Tree 2 O B A C O A

B Tree 3 C Map the characters (mutations) onto tree 1

1 2 3 4 5 O C B A O T G G A A A G C G A A 1 B G C A A A 2 C G C A C T Map the characters (mutations) onto tree 1

1 2 3 4 5 O C B A O T G G A A 4 3 A G C G A A 5 1 B G C A A A 2 C G C A C T 3

Total # number of steps = 6 Actually, there is more than one way to map character 3

3 O C B A O C B A O G 3 3 A G 3 B A C A 3

Either way the character contributes 2 steps to the overall tree length Map the characters onto tree 2

1 2 3 4 5 O A B C O T G G A A A G C G A A 45 B G C A A A 1 3 C G C A C T 2

# steps = 5 Tree 3

1 2 3 4 5 O B A 3 C O T G G A A 3 A G C G A A 45 B G C A A A 1 C G C A C T 2

Length = 6 steps Which tree had the shortest branch lengths (most parsimonious)?

Most parsimonious tree O C B A O A O A B C

C B Tree 1: length = 6 Tree 2: length = 5

O O B A CA

Tree 3: length = 6 B C Where do the Whales belong?

Example from Freeman & Herron, Fig. 4.8 Freeman & Herron, Fig. 4.9: Using maximum parsimony, looks like the whales cluster with the hippos (and cows) Parsimony • Simplest and fastest method of phylogenetic reconstruction

• Can give misleading results if rates of evolution (rates that mutations occur) differ in different lineages

• Tends to become less accurate as genetic distances get greater • Could be mislead by reversals, homoplasy: Because with only 4 nucleotides, after a while, same mutations occur repeatedly at a given site (called “saturation”) – “multiple hits (mutations) per site” Distance Matrix

Continuous or Discrete Characters Distance Matrix

• Calculate pairwise distances between taxa • Choose the tree that minimizes overall distances between taxa

proportion sequence distance at 2 (hypothetical data)

mouse cat dog dolphin seal

Mouse Cat 0.05 Dog 0.03 0.02 Dolphin 0.08 0.15 0.03 Seal 0.09 0.23 0.01 0.02 Freeman & Herron, Fig. 4.10: Using genetic distances, looks like the whales again cluster with the hippos (and cows) Distance Matrix

• Generally more accurate than parsimony

• Like parsimony, it tends to be computationally fast Maximum Likelihood (R.A. Fisher) • Probability of the data given the tree • This is a “Frequentist” method: one true answer (one true tree)

• Draw from the data (probability distribution of DNA sequence data) to find the true tree

• Choose the tree (x, y axis) that maximizes the probability of the observed data (z axis)

Z: Probability of the data

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal x,y: Tree space of . 17(6):368-76. Maximum Likelihood (R.A. Fisher) • Probability of the data given the tree • The aim of maximum likelihood estimation is to find the parameter value(s) that makes the observed data most likely.

• For example: finding a mean. If you want to have a number that describes the data, like height, you could find the mean

P(data/tree) = likelihood(tree/data)

Tree = hypothesis Z: Probability of the data

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal x,y: Tree space of Molecular Evolution. 17(6):368-76. Maximum Likelihood (R.A. Fisher)

• Often yields more accurate tree than parsimony or distance

• Relies on an accurate assumption of which mutations are more probable (A->G more often than A->T or C? i.e. accurate model of molecular evolution) • Computationally intensive Bayesian Inference Reverend Thomas Bayes (1702-1760)

• Probability of a tree given the data • Uses prior information on the tree • Does not assume that there is one correct tree • Will modify estimate based on additional information

• Uses Bayes’ Theorem

P(A/B) = P(B/A)P(A) P(B) Bayesian Inference Reverend Thomas Bayes (1702-1760)

• Uses Bayes’ Theorem

P(A/B) = P(B/A)P(A) = P(tree/data) = P(data/tree)P(tree) P(B) P(data)

P(A) = prior probability, probability of a tree P(A/B) = posterior probability—probably of tree given the data P(B/A) = the probability B (data) of observing given A (tree), is also known as the likelihood. It indicates the compatibility of the evidence with the given hypothesis. P(B) = probability of the data Bayesian Inference Reverend Thomas Bayes (1702-1760)

• Probability of a tree given the data:

• Will modify estimate based on additional information: so as you get more data, you update your hypothesis for the tree

• Uses prior information on the tree: this is where you start

• The sequential use of the Bayes' formula (recursive): when more data become available after calculating a posterior distribution, the posterior becomes the next prior

• Does not assume that there is one correct tree Bayesian Inference

• Like Likelihood, often yields more accurate tree than parsimony or distance

• Computationally more intensive than parsimony or distance matrix, but less intensive than likelihood

• Needs a prior probability for the tree and a model of evolution Potential problems of Phylogenetic Reconstruction • Sufficient Amount of Data: – With enough data most statistical methods usually yield the same tree (but not always—sometimes there is no single resolved tree) – Insufficient data would yield a tree that lacks resolution (lacks statistical power)

trees vs species trees – Evolutionary history of individual genes are not necessarily the same – Should try to get data from many genes, or the whole Challenges of Phylogenetic Reconstructions

• Different parts of the genome might have different evolutionary histories (different gene , horizontal gene transfers, allopolyploidy, etc)

• So, there might not be one true tree for a group of taxa, and relationships might be difficult to resolve because they are inherently complex • Current trend is to use whole genome data to reconstruct phylogenies

• Gain a comprehensive picture of the evolutionary relationships among taxa for the whole genome Phylogenetic Reconstructions

• Typically, evolutionary will use a variety of methods to reconstruct a phylogeny. • Maximum likelihood and Bayesian methods are considered more robust. • Tree is only as good as the data. Having many homoplastic characters (due to convergent evolution, reversals, etc.) will make the reconstruction less robust • Standard to use Bootstrapping to assess the validity of the tree • Understanding is fundamental to understanding evolution • Much of statistics was in fact developed in order to model evolutionary processes (such as ANOVA, analysis of variance) 1. Sometimes the Molecular Clock (based on genetic data) conflicts with the Geological Record. Why would this happen?

(A) Sometimes there are gaps in the geological record, because fossils do not form everywhere, and mutation rate might vary between different species

(B) Radiometric dating relies on chance events in the preservation of isotopes, making the timing events in the geological time scale less accurate than the molecular clock

(C) Mutation rates slow down as you go back in time, making estimation of timing of events less accurate as you go back in time

(D) The molecular clock is calculated from radioisotopes, while the geological record is obtained from fossil data. The two can conflict when fossils end up displaced from their original sedimentary layer 2. You are a medical researcher working on HIV. A strain has appeared in Madison, Wisconsin. To determine which drugs would be most effective in treating this new strain (because different strains are resistant to different drugs), you need to determine its recent evolutionary history. You decide to reconstruct the evolutionary history of HIV by using a phylogenetic approach. Thus, you collect samples from patients in various geographic locations and sequence a fragment of RNA. Using parsimony, which is the correct phylogeny for HIV-1 based on the data below?

HIV-1, Uganda, Africa ACAUG HIV-1, San Francisco, USA UGAUG HIV-1, Madison, USA UAAGG HIV-1, New York, USA UAAAG HIV-1, Paris ACAUC HIV-2 Africa (ancestral outgroup): ACCUG 3. Which of the following is most TRUE regarding phylogenetic reconstructions?

(A) Phylogenetic reconstruction based on any gene would yield the same tree (B) Parsimony is the most accurate method for reconstructing phylogenies (C) Phylogenetic reconstructions based on different genes could yield different phylogenetic trees (D) Maximum likelihood relies on maximizing distances among taxa (E) There is always one true tree, and having enough genetic data will inevitably result in one tree Answers

• 1A • 2C • 3C