<<

Introduction to What is Molecular

• Phylogenetics Prof. Dr. Nizamettin AYDIN – the study of evolutionary relationships in , – one part of the larger field of , which Phylogenetics also includes . • The term taxonomy connotes the process and methodology for the naming and classification of organisms. • The systematics [email protected] – the branch of that deals with classification http://www3.yildiz.edu.tr/~naydin and nomenclature; taxonomy

1 2

…What is Molecular Phylogenetics… …What is Molecular Phylogenetics

• The context of evolutionary biology is • Example: phylogeny, – relationship among – the connections between all groups of organisms as understood by ancestor/descendant relationships. • The molecular mechanisms of organisms crocodiles studied strongly suggests that all organisms on earth have a common ancestor. crocodiles birds – Thus, the species are related to each other by the snakes rodents virtue of having evolved from the same common marsupials snakes ancestor. primates • Such a relationship of species is called lizards marsupials phylogeny and it’s graphical representation is called a .

3 4

Etymology A Brief History of Molecular Phylogenetics

• phyl-(or phylo-) • 1900s – Latin, from Greek, from phylē, phylon. – Immunochemical studies • tribes, races or phyla • cross-reactions stronger for closely related organisms • phyla – Nuttall (1902) - apes are closest relatives to humans! – a direct line of descent within a group • 1960s - 1970s • genetics (genetic + -ics) – methods, electrophoresis, DNA – From Greek genetikos from genesis (origin) hybridization and Polymerase Chain Reaction (PCR) • laws of origination (1872) contributed to a boom in molecular phylogeny • study of heredity (from 1891) • late 1970s to present • phylogeny – Discoveries using molecular phylogeny – the evolutionary history of a kind of • Endosymbiosis - Margulis, 1978 – the of a genetically related group of organisms • Divergence of phyla and kingdom - Woese, 1987 • phylogenetics • Many Tree of Life projects completed or underway – a branch of science that deals with phylogeny

5 6

Copyright 2000 N. AYDIN. All rights reserved. 1 Molecular data vs. Morphology/Physiology Phylogenetic concepts: Interpreting a Phylogeny

• Strictly heritable entities • Can be influenced by environmental factors Sequence A • Data is unambiguous • Ambiguous modifiers: “reduced”, “slightly Sequence B elongated”, “somewhat • Physical position in tree Sequence C flattened” is not meaningful • Regular & predictable evolution • Unpredictable evolution • Swiveling can only be • Quantitative analyses • Qualitative argumentation Sequence D done at the nodes • Ease of homology assessment • Homology difficult to assess • Only tree structure • Relationship of distantly related • Only close relationships can be matters organisms can be inferred confidently inferred Sequence E • Abundant and easily generated • Problems when working with Present with PCR and sequencing micro-organisms and where Time visible morphology is lacking

7 8

Phylogenetic concepts: Interpreting a Phylogeny Tree Terminology

• Relationships are illustrated by a phylogenetic Sequence A tree /

Sequence B – Combination of Greek dendro/tree and • Physical position in tree gramma/drawing Sequence E is not meaningful – A dendrogram is a tree diagram • Swiveling can only be • frequently used to illustrate the arrangement of the Sequence D done at the nodes clusters produced by hierarchical clustering. • Only tree structure – are often used in computational matters biology Sequence C • to illustrate the clustering of genes or samples, Present sometimes on top of heatmaps. Time

9 10

Tree Terminology Tree Terminology

• A is a type of phylogenetic tree that • The branching pattern is called the tree’s topology only shows tree topology • Trees can be represented in several forms: – the shape indicating relatedness. • It shows that, say, humans are more closely related to Rectangular cladogram Slanted cladogram chimpanzees than to gorillas, – but not the time or genetic distance between the species. – Combination of Greek clados/branch and gramma/drawing

• topology – the branching structure of the tree

11 12

Copyright 2000 N. AYDIN. All rights reserved. 2 Tree Terminology Same tree - seven different views

Rectangular Phylogram, Rectangular Cladogram, Slanted Cladogram, Circular • Circular cladogram Phylogram, Circular Cladogram, Radial Phylogram and Radial Cladogram

13 14

Tree Terminology Tree Terminology

Rooted trees Unrooted trees Operational taxonomic units (OTU) / Taxa A D Internal nodes A B B A E C B D C Root E Terminal nodes F C D F Sisters • Rooted trees: E Root – has a root that denotes common ancestry F • Unrooted trees: – Only specifies the degree of kinship among taxa but not Branches Polytomy the evolutionary path , plural taxa. (taxonomy): Any group or rank in a biological classification into which related organisms are classified.

15 16

Number of trees Tree Terminology

• The number of rooted • The number of unrooted • Scaled trees: trees for n species: trees for n species: – Branch lengths are proportional 2푛 − 3 ! 2푛 − 5 ! to the number of 2푛−2 푛 − 2 ! 2푛−3 푛 − 3 ! /amino acid changes that occurred on that branch • Number of possible rooted and unrooted trees that can describe the possible (usually a scale is included). relationships among fairly small numbers of data sets. • In the best of cases, scaled trees are also additive, meaning that – the physical length of the branches connecting any two nodes is an accurate representation of their accumulated differences. 17 18

Copyright 2000 N. AYDIN. All rights reserved. 3 Tree Terminology Tree Terminology

Monophyletic groups Paraphyletic groups Saturnite 1 Jupiterian 32 • Unscaled trees: Saturnite 2 Jupiterian 5 Saturnite 3 A – Branch lengths are not Jupiterian 67 Martian 1 Human 11 B proportional to the number of nucleotide/amino acid Martian 3 Jupiterian 8 C changes Martian 2 Human 3 D • usually used to illustrate • Monophyletic groups: E evolutionary relationships only. – All taxa within the group are derived from a single F common ancestor and members form a natural .

– line up all terminal nodes and convey only their relative • Paraphyletic groups: kinship without making any representation regarding the – The common ancestor is shared by other taxon in the group number of changes that separate and members do not form a natural clade. 19 20

Gene vs. Species Trees Character and Distance Data

• Gene tree • The molecular data used to generate – a phylogenetic tree based on the divergence phylogenetic trees fall into one of two observed within a single homologous gene. categories: • Such trees may represent the evolutionary history of a – Characters gene but not necessarily that of the species in which it is • a well-defined feature that can exist in a limited number found. of different states • Species trees – Distances • a measure of the overall, pairwise difference between two – best obtained from analyses that use data from data sets multiple genes. • Both DNA and protein sequences are examples • For example, a study on the evolution of plant species of data that describe a set of discrete character used more than 100 different genes to generate a species states. tree for plants. 21 22

Methods in Phylogenetic Reconstruction Comparison of Methods

• Distance Based Methods Distance Maximum parsimony Maximum likelihood – calculate pairwise distances between sequences, and group • Uses only pairwise • Uses only shared • Uses all data sequences that are most similar. distances derived characters – This approach has potential for computational simplicity and • Minimizes distance • Minimizes total • Maximizes tree therefore speed between nearest distance likelihood given • Character Based Methods (Maximum parsimony) neighbors specific parameter – assumes that shared characters in different entities result from values common descent. • Very fast • Slow • Very slow – Groups are built on the basis of such shared characters, and the • Easily trapped in • Assumptions fail • Highly dependent on simplest explanation for the evolution of characters is taken to be local optima when evolution is assumed evolution the correct, or most parsimonious one. rapid model • Probabilistic Methods (Maximum likelihood) • Good for generating • Best option when • Good for very small tentative tree, or tractable (<30 taxa, data sets and for – compute the probability that a data set fits a tree derived from choosing among homoplasy rare) testing trees built that data set, given a specified model of sequence evolution. multiple trees using other methods

23 24

Copyright 2000 N. AYDIN. All rights reserved. 4 Methods in Phylogenetic Reconstruction UPGMA

• Distance Based Methods • Unweighted-Pair-Group Method with – Using a , pairwise distances/dissimilarities Arithmetic mean (UPGMA) are calculated – Oldest and simplest distance matrix method – Creates a distance/dissimilarity matrix – A phylogenetic tree is calculated with clustering algorithms, – Originally proposed in the early 1960s to help with using the distance matrix. A the evolutionary analysis of morphological A A characters, B B B – requires data that can be condensed to a measure of C C genetic distance between all pairs of taxa being D considered. – Examples of clustering algorithms include • Unweighted Pair Group Method using Arithmetic averages (UPGMA) – requires a distance matrix such as one that might be • clustering. created for a group of 4 taxa called A, B, C, and D.

25 26

UPGMA UPGMA

• Assume that the pairwise distances between • UPGMA begins by clustering the two species each of the taxa are given in the following with the smallest distance separating them into matrix: a single, composite group.

Species A B C – Assume that the smallest value in the distance B d - - matrix corresponds to dAB in which case species A AB and B are the first to be grouped (AB). C dAC dBC - • After the first clustering, a new distance matrix is D d d d AD BD CD computed with the distance between the new group (AB) and species C and D being calculated as – dAB : distance between species A and B

– dAC : distance between species A and C 푑AC+푑BC 푑AD+푑BD 푑 AB C = and 푑 AB D = – … 2 2 27 28

UPGMA UPGMA - example

• The species separated by the smallest distance • Consider the following alignment between five in the new matrix are then clustered to make different DNA sequences another new composite species. • The process is repeated until all species have been grouped. – If scaled branch lengths are to be used on the tree to • Pairwise distance matrix represent the evolutionary distance between Species A B C D – Smallest distance: • d , so species, branch points are positioned at a distance B 9 - - - DE – Species D and species halfway between each of the species being grouped C 8 11 - - E are grouped D 12 15 10 - • i.e., at dAB/2 for the first clustering E 15 18 13 5

29 30

Copyright 2000 N. AYDIN. All rights reserved. 5 UPGMA - example Estimation of Branch Lengths

Species A B C D Species A B C • Tree describes the relatedness of sequences B 9 - - - B 9 - - C 8 11 - - • It is possible for the topology of phylogenetic C 8 11 - D 12 15 10 - trees to convey information about 12 + 15 15 + 18 10 + 13 DE = 13.5 = 16.5 = 11.5 E 15 18 13 5 2 2 2 – the relative degree to which sequences have

Species B AC diverged.

9 + 11 – Scaled trees that convey that information, often AC = 10 - 2 referred to as , 15 + 18 13.5 + 11.5 DE = 16.5 = 12.5 2 2 • the length of branches correspond to the inferred amount of time that the sequences have been accumulating Species (AC)B substitutions independently. (AC)B −

16.5 + 12.5 DE = 19.5 2

31 32

Estimation of Branch Lengths Estimation of Branch Lengths

• The relative length of each branch in a • A scaled tree showing the branch lengths separating four cladogram can be calculated using the of the species depicted in slide 30. • Branch lengths are information in a distance matrix. shown next to each – In the example, the dDE is 5, branch. • the pair of branches connecting each of those species to • Branches are also their common ancestor should each be dDE /2 or 2.5 units drawn to scale to long on a tree with scaled branch lengths. reflect the amount of – A and C should be connected to their common differences between all ancestor by branches that are dAC/2 or 4 units long, species. – The branch point between (AC) and (DE) should • This very simple approach to estimating branch lengths be connected to (AC) and (DE) by branches that actually allows UPGMA to intrinsically generate rooted are both d(AC)(DE)/2 or 6.25 units long, phylogenetic trees.

33 34

Estimation of Branch Lengths Estimation of Branch Lengths

• Determining branch lengths for a scaled tree is only slightly • Phylogeny reconstruction for 3 sequences more complicated – There is a single tree topology – when it cannot be assumed that evolutionary rates are the same for all lineages. – The branch lengths (a, b, c) : • The simplest tree whose a + b = dAB A B C branch lengths might have b + c = d A - a+b a+c A a BC some meaningful information a + c = d B - - b+c C AC C - - - c is one with just three species b • Input: B (A, B, C) and one branch – dAB, dBC and dAC (pairwise distances) point, such as the one shown. • Output: • On such a tree, the length of each of the three branches can A a a = (dAB + dAC – dBC) / 2 be represented by a single letter (a, b, and c) for which we C b = (dAB + dBC – dAC) / 2 c know the following must be true: b c = (dAC + dBC – dAB) / 2 B dAB = a + b ; dBC = b + c ; dAC = a + c

35 36

Copyright 2000 N. AYDIN. All rights reserved. 6 Estimation of Branch Lengths - example Estimation of Branch Lengths - example

• Distance matrix of 3 sequences and unrooted tree • adding (1) and (4) yields a + b + b – a = 2b = 22 + 2 = 24 A B C A a A -- 22 39 C 2b = 24 B -- -- 41 c b = 24 / 2 = 12 C ------B b – distance from A to B = a + b = 22 (1) • so – distance from A to C = a + c = 39 (2) a + b = a + 12 = 22; – distance from B to C = b + c = 41 (3) a = 22 – 12 = 10 • finally A 10 • subtracting (3) from (2) yields: C 29 b + c – (a + c) = b – a = 41 – 39 = 2 (4) a + c = 10 + c = 39; 12 c = 39 – 10 = 29 B

37 38

Neighbor's Relation Method Neighbor's Relation Method

• Popular variant of the UPGMA method • The topology of a phylogenetic tree such as the one shown below gives rise to some useful algebraic • emphasizes pairing species in such a way that relationships between neighbors. Species A B C

– a tree is created with the smallest possible branch B dAB - - A C lengths overall. a c C dAC dBC - e D dAD dBD dCD Species A B C B a+b - - b d C a+e+c b+e+c - • On any unrooted tree, pairs of species that are B D D a+e+d b+e+d c+d separated from each other by just one internal • If the tree above is a true tree for which additivity node are said to be neighbors. holds, then the following should be true:

dAC + dBD = dAD + dBC = a + b + c + d + 2e = dAB + dCD + 2e – where a, b, c, and d are the lengths of the terminal branches and e is the length of the single central branch. 39 40

Neighbor's Relation Method Neighbor's Relation Method - example

• The following conditions, known together as the four- • Consider the alignment: point condition, will also be true: A ACGCGTTGGGCGATGGCAAC B ACGCGTTGGGCGACGGTAAT dAB + dCD < dAC + dBD ; dAB + dCD < dAD + dBC • It is in this way that a neighborliness approach C ACGCATTGAATGATGATAAT considers all possible pairwise arrangements of four D ACACATTGAGTGATAATAAT species and determines which arrangement satisfies the • The distances between these sequences can be shown four-point condition. as a table: – An important assumption of the four-point condition is that A B C D branch lengths on a phylogenetic tree should be additive and, A - 3 7 8 while it is not especially sensitive to departures from that B - - 6 7 assumption, data sets that are not additive can cause this C - - - 3 method to generate a tree with an incorrect topology. D - - - -

41 42

Copyright 2000 N. AYDIN. All rights reserved. 7 Estimation of Branch Lengths - example Neighbor-Joining Methods

• Using this information, an A B C D • A variant of neighborliness unrooted tree showing the A - 3 7 8 B - - 6 7 • an agglomerative technique, and so operates relationship between these C - - - 3 using iteration, sequences can be drawn: D - - - - – building the tree from the bottom-up • Start with a star-like tree with all species C A coming off a single central node regardless of 2 1 their number. 4 • Neighbors are then sequentially found that

1 2 minimize the total length of the branches on B D the tree.

43 44

Neighbor-Joining Methods Neighbor-Joining Methods

• The input is an n×n dissimilarity/distance • During the iteration, a pair of clusters is selected to be made siblings; matrix d. – this results in the trees for the two clusters being • In the first iteration, merged into a larger rooted tree by making their roots – the n leaves are all in their own clusters; siblings. • In subsequent iterations, • When there are only three subtrees, then the three subtrees are merged into a tree on all the taxa by – each cluster is a set of leaves, adding a new node, r, and making the roots of the • but the clusters are disjoint. three subtrees adjacent to r. • At the beginning of each iteration, the taxa are • Note that this description suggests that neighbor partitioned into clusters, and for each cluster we joining produces a rooted tree. have a rooted tree that is leaf-labelled by the – However, after the tree is produced, the root is ignored, so that neighbor joining actually returns an unrooted elements in the cluster. tree.

45 46

Neighbor-Joining Methods Character-Based Methods of Phylogenetics

• The concept of parsimony is at the very heart of all character-based methods of phylogenetic reconstruction. – Parsimonious: someone who was especially careful with the spending of their money. • In a biological sense, parsimony is used to describe – the process of attaching preference to one evolutionary pathway over another on the basis of • The neighbor-joining method: a new method for reconstructing phylogenetic which pathway requires the invocation of the trees. (1987). Molecular Biology and Evolution. doi:10.1093/oxfordjournals.molbev.a040454 smallest number of mutational events.

47 48

Copyright 2000 N. AYDIN. All rights reserved. 8 Character-Based Methods of Phylogenetics Character-Based Methods of Phylogenetics

• Phylogenetic trees represent theoretical models • Informative and Uninformative Sites that depict the evolutionary history of 3 or more sequences. • The two premises that underlie biological parsimony are quite simple: – Mutations are exceedingly rare events • The short four-way multiple sequence alignment – The more unlikely events a model invokes, the less shown above contains positions that fall into two likely the model is to be correct. categories in terms of their information content for a • As a result, the relationship that requires the parsimony analysis: fewest number of mutations to explain the current – those that have information (are informative) state of the sequences being considered is the – those that do not (are uninformative). • The relationship between four sequences can be relationship that is most likely to be correct. described by only three different unrooted trees 49 50

Character-Based Methods of Phylogenetics Character-Based Methods of Phylogenetics

• Position 1: All 4 sequences • Position 2: have the same character (a Uninformative from a "G") and the position is therefore said to be parsimony invariant. perspective because – Invariant sites are clearly one mutation occurs uninformative because each of the three possible trees in all three of the that describe the possible trees. relationship of the four sequences invokes exactly the same number of mutations (0)

51 52

Character-Based Methods of Phylogenetics Character-Based Methods of Phylogenetics

• Position 3: • Position 4: Uninformative Uninformative because all three because all three trees require two trees require three mutations mutations.

53 54

Copyright 2000 N. AYDIN. All rights reserved. 9 Character-Based Methods of Phylogenetics Character-Based Methods of Phylogenetics

• Position 5: • Position 6: Informative Informative because one of the because one of the three trees three trees invokes only one invokes only one mutation and the mutation and the other two other two alternative trees alternative trees both require two both require two mutations mutations

55 56

Character-Based Methods of Phylogenetics Character-Based Methods of Phylogenetics

• In general, for a position to be informative regardless of • Once uninformative sites have been identified and how many sequences are aligned, discarded, implementation of the parsimony approach in – it has to have at least two different its simplest form can be straightforward. – each of these nucleotides has to be present at least twice. • Every possible tree is considered individually for each • All parsimony programs begin by applying this fairly informative site. simple rule to the data set being analyzed. • A running tally is maintained for each tree that keeps • Notice that 4 of the 6 positions being considered in the track of the minimum number of substitutions required alignment shown in slide 50 are simply discarded and not for each position. considered any further in a parsimony analysis. • After all informative sites have been considered, the tree – All of those sites would have contributed to the pairwise (or trees) that needs to invoke the smallest total number similarity scores used by a distance-based approach, and this difference alone can generate substantial differences in the of substitutions is labeled the most parsimonious. conclusions reached by both types of approaches. • https://paup.phylosolutions.com/

57 58

Character-Based Methods of Phylogenetics Maximum parsimony: exhaustive stepwise addition

B C • Maximum Parsimony Step 1 – All possible trees are determined for each position A of the sequence alignment B D B D C B C C D – Each tree is given a score based on the number of Step 2 A A A evolutionary step needed to produce said tree E E – The most parsimonious tree is the one that has the B D B D B D E fewest evolutionary changes for all sequences to be C C C ………………… derived from a common ancestor A A A – Usually several equally parsimonious trees result Step 3 from a single run.

59 60

Copyright 2000 N. AYDIN. All rights reserved. 10 Maximum Parsimony - example Maximum Parsimony - example

• Maximum parsimony methods predict the • This continues for each position in the evolutionary tree that minimizes the number of alignment. steps required to generate the observed • Those trees that produce the smallest number variation in the sequences. of changes overall for all sequence positions – First, a multiple sequence alignment must first be are identified. obtained. – This is a rather time consuming algorithm that only • For each aligned position, phylogenetic trees works well if the sequences have a strong sequence that require the smallest number of similarity. evolutionary changes to produce the observed sequence changes are identified.

61 62

Maximum Parsimony - example Maximum Parsimony - example

• Assuming we have 4 sequences AAA AAA 1 – There are 3 possible trees: AAA AGA AAA AAA 1 1 1 1 2 AAA AGA AGA GGA AAG GGA AAG AAA Total #substitutions = 3 Total #substitutions = 4 • The optimal tree is obtained by adding the number of changes at each informative site for each tree, and • The left tree is preferred over the right tree. picking the tree requiring the least total number of changes. • For a large number of sequences the number of trees to examine becomes so large that it might not be possible to examine all possible trees.

63 64

Maximum Parsimony - example Maximum Parsimony - example

• Consider the following sequences S1 C A C C C C T T S2 A A C C C C A T S3 C A C T G C T T S1 C A C C C C T T S4 A A C T G C T A S2 A A C C C C A T (S1,S2),(S3,S4) 2 0 0 1 1 0 1 1 6 √ S3 C A C T G C T T (S1,S3),(S2,S4) 1 0 0 2 2 0 1 1 7 S4 A A C T G C T A S1 C C S3 S1 C A S2 (S1,S2),(S3,S4) 2 0 0 1 1 0 1 1 6 √ (S1,S3),(S2,S4) 1 0 0 2 2 0 1 1 7 C A (S1,S4),(S2,S3) 2 0 0 2 2 0 1 1 7 S2 A mutation=2 A S4 S3C mutation=1 A S4 (S1,S2), (S3, S4) (S1,S3), (S2, S4)

65 66

Copyright 2000 N. AYDIN. All rights reserved. 11 Methods in Phylogenetic Reconstruction Tree Confidence

• Maximum Likelihood • All phylogenetic trees represent hypotheses regarding the – Creates all possible trees like Maximum Parsimony method evolutionary history of the sequences that make up a data – But instead of retaining trees with shortest evolutionary set. steps, • Like any good hypothesis, it is reasonable to ask two • employs a model of evolution whereby different rates of transition/transversion ratios can be used questions about how well it describes the underlying data: – Each tree generated is calculated for the probability that it – How much confidence can be attached to the overall tree and its reflects each position of the sequence data component parts (branches)? – Calculation is repeated for all nucleotide sites – How much more likely is one tree to be correct than a particular or – Finally, the tree with the best probability is shown as the randomly chosen alternative tree? maximum likelihood tree • To address these two questions, a powerful • usually only a single tree remains technique called bootstrapping has become the – It is a more realistic tree estimation because it does not predominant favorite for addressing the first question, and assume equal transition-transversion ratio for all branches. a simple parametric comparison of two trees is typical of those used to address the second.

67 68

Bootstrapping Bootstrapping

• Bootstrap analysis is a kind of statistical analysis to test the • Bootstrap tests allow for a rough quantification reliability of certain branches in the evolutionary tree. of those confidence levels. – In , it is any test or metric that relies on random sampling with replacement. • The basic approach of a bootstrap test is – It falls in the broader class of resampling methods. straightforward: • It involves resampling one's own data, with replacement, to create a series of bootstrap samples of the same size as the • A subset of the original data is drawn (with original data. replacement) from the original data set and a • In the case of (amino acid) sequences, the tree is inferred from this new data set, resampled data are the nucleotides (amino acids) of a sequence while the statistical significance of a specific cluster is given by – as illustrated in the next slide the fraction of trees, based on the resampled data, containing that cluster.

69 70

Bootstrapping Bootstrapping

• Illustration of a boostrap test for a • Computational method to estimate the confidence level of a data set that is 10 characters long certain phylogenetic tree. (positions: 1 through 10) and five Pseudo sample 1 001122234556667 sequences (I through V) deep. Sample rat GGAAGGGGCTTTTTA • The original data set and its most 0123456789 human GGTTGGGGCTTTTTA rat GAGGCTTATC GGTTGGGCCCCTTTA parsimonious tree (a). human GTGGCTTATC fruitfly CCTTCCCGCCCTTTT • A random number generator chooses turtle GTGCCCTATG oak AATTCCCGCTTCCCT fruitfly CTCGCCTTTG duckweed AATTCCCCCTTCCCC 10 times from the original 10 oak ATCGCTCTTG duckweed ATCCCTCCGG positions and creates the three Pseudo sample 2 resampled data sets with their 445556777888899 rat CCTTTTAAATTTTCC corresponding trees (b). rat human CCTTTTAAATTTTCC human turtle CCCCCTAAATTTTGG • A consensus tree (c) for the three fruitfly CCCCCTTTTTTTTGG turtle oak CCTTTCTTTTTTTGG resampled data sets is shown with fruit fly duckweed CCTTTCCCCGGGGGG circled values indicating the fraction oak of bootstrapped trees that support the duckweed Many more replicates clustering of sequences I and II and Inferred tree (between 100 - 1000) sequences IV and V. 71 72

Copyright 2000 N. AYDIN. All rights reserved. 12 Bootstrap values Parametric Tests

• The underlying principle of parsimony 100 rat 65 human suggests that the tree that invokes the smallest turtle number of substitutions is the one that is most 0 fruit fly likely to depict the true relationship between 55 oak duckweed the sequences. • A common question is "How much more likely • Values are in percentages is the most parsimonious tree than a particular • Conventional practice: alternative that has previously been put – only values 60-100% are shown forward to describe the relationship between these taxa?"

73 74

Parametric Tests Parametric Tests

• One of the first parametric tests devised to answer • A paired t-test with n-1 degrees of freedom can such questions for parsimony analyses is that of H. Kishino and M. Hasegawa (1989). then be used to test the null hypothesis that the – Their test assumes that informative sites within an two trees are not different from each other: alignment are both independent and equivalent and uses 퐷 the difference in the minimum number of substitutions 푡 = invoked by two trees, D, as a test statistic. 푛 푛푉 • A value for the variance, V, of D is determined by • Alternative parametric tests are available not considering each of the informative sites separately as follows: just for parsimony analyses but for distance 푛 푛 2 matrix and maximum likelihood trees as well. 푛 1 푉 = 퐷 − 퐷 (푛 − 1) 푖 푛 푘 푘=1 푘=1 where n is the number of informative sites.

75 76

Some Discoveries Made Using Molecular Phylogenetics Some Discoveries Made Using Molecular Phylogenetics • Endosymbiosis: Origin of the Mitochondrion and Chloroplast • Universal Tree of Life

– Using rRNA -Purple sequences Other bacteria – Able to study the Chloroplasts relationships of Mitochondria uncultivated organisms, obtained Root Cyanobacteria Eukaryotes from a hot spring in Yellowstone National Park

• Mitochondria and chloroplasts are derived from the -purple bacteria and the cyanobacteria respectively, via separate endosymbiotic events.

77 78

Copyright 2000 N. AYDIN. All rights reserved. 13 Some Discoveries Made Using Molecular Phylogenetics Problems and Errors in Phylogenetic Reconstruction • Relationships within species: HIV subtypes • Inherent strengths and weaknesses in different A Rwanda tree-making methodologies. Ivory Coast Italy B U.S. Uganda • More is better U.S. India Rwanda – Errors in inferred phylogeny may be caused by U.K. C small data sets and/or limited sampling. Ethiopia Uganda S. Africa • Unsuitable sequences Uganda D – those undergoing rapid nucleotide changes or slow Tanzania Netherlands to zero changes overtime may skew phylogenetic Russia Romania G estimations F Taiwan Cameroon Brazil Netherlands

79 80

Problems and Errors in Phylogenetic Reconstruction

• Mutations: – Duplications, inversions, insertions, deletions etc. can give inaccurate signals • Genomic hotspots: – small regions of rapid evolution are not easily detected • Homoplasy: – nucleotide changes that are similar but occurred independently in separate lineages are mistakenly assumed as inherited changes • Sample contamination / mislabeling: – always a possibility when working with large data sets

81

Copyright 2000 N. AYDIN. All rights reserved. 14