Math/C SC 5610 Computational Biology Lecture 12: Phylogenetics

Math/C SC 5610 Computational Biology Lecture 12: Phylogenetics Stephen Billups University of Colorado at Denver Math/C SC 5610Computational Biology – p.1/25 Announcements Project Guidelines and Ideas are posted. (proposal due March 8) CCB Seminar, Friday (Mar. 4) Speaker: Jack Horner, SAIC Title: Phylogenetic Methods for Characterizing the Signature of Stage I Ovarian Cancer in Serum Protein Mas Time: 11-12 (Followed by lunch) Place: Media Center, AU008 Math/C SC 5610Computational Biology – p.2/25 Outline Distance based methods for phylogenetics UPGMA WPGMA Neighbor-Joining Character based methods Maximum Likelihood Maximum Parsimony Math/C SC 5610Computational Biology – p.3/25 Review: Distance Based Clustering Methods Main Idea: Requires a distance matrix D, (defining distances between each pair of elements). Repeatedly group together closest elements. Different algorithms differ by how they treat distances between groups. UPGMA (unweighted pair group method with arithmetic mean). WPGMA (weighted pair group method with arithmetic mean). Math/C SC 5610Computational Biology – p.4/25 UPGMA 1. Initialize C to the n singleton clusters f1g; : : : ; fng. 2. Initialize dist(c; d) on C by defining dist(fig; fjg) = D(i; j): 3. Repeat n ¡ 1 times: (a) determine pair c; d of clusters in C such that dist(c; d) is minimal; define dmin = dist(c; d). (b) define new cluster e = c S d; update C = C ¡ fc; dg Sfeg. (c) define a node with label e and daughters c; d, where e has distance dmin=2 to its leaves. (d) define for all f 2 C with f 6= e, dist(c; f) + dist(d; f) dist(e; f) = dist(f; e) = : (avg. of previous distances) 2 Math/C SC 5610Computational Biology – p.5/25 Example Math/C SC 5610Computational Biology – p.6/25 WPGMA 1. Initialize C to the n singleton clusters f1g; : : : ; fng. 2. Initialize dist(c; d) on C be defining dist(fig; fjg) = D(i; j): 3. Repeat n ¡ 1 times: (a) determine pair c; d of clusters in C such that dist(c; d) is minimal; define dmin = dist(c; d). (b) define new cluster e = c S d; update C = C ¡ fc; dg Sfeg. (c) define a node with label e and daughters c; d, where e has distance dmin=2 to its leaves. (d) define for all f 2 C with f 6= e, jcj dist(c; f) + jdj dist(d; f) dist(e; f) = dist(f; e) = : (weighted avg.) jcj + jdj Math/C SC 5610Computational Biology – p.7/25 Ultrametric Trees Key feature: distance to root is the same for every leaf node. (distance ¼ time since divergence). Given a tree with positive edge weights ½(i; j)–if the value di;j of the distance function between leaves i and j is the sum of the edge weights along the path connecting i and j, then the distance function d is an additive metric. If the path length from the root r to every leaf is identical, then the distance function d is called an ultrametric. Math/C SC 5610Computational Biology – p.8/25 Ultrametric Trees and UPGMA If the distances between taxa are an ultrametric (for some tree), then UPGMA always correctly constructs the original topology. If the distances are not an ultrametric (but are still additive), then UPGMA yields a tree with incorrect topology and incorrect branch lengths. One method of fixing this problem is the Farris Transformed Distance Method. Math/C SC 5610Computational Biology – p.9/25 Tree Topology The topology of a phylogenetic tree on n taxa is an unweighted tree t = (V; E) whose leaves are the n taxa. Associated with a given topology t, we can define a weighted tree t(d1; : : : ; dk), where dj is the length of the edge connecting node j to its parent. Math/C SC 5610Computational Biology – p.10/25 The Farris Transformation Let T be a tree with root r and leaves 1; : : : ; n, and define di;j = length of path connecting i and j: (Note that d is additive). Define new distances di;j ¡ di;r ¡ dj;r ei;j = + d¹r; 2 where d¹r is the average distance between r and the leaves. Theorem: UPGMA applied to the transformed distances generates the correct topology of T . Math/C SC 5610Computational Biology – p.11/25 Example Additive Metric: 1 1 a b c d e 1 1 a 0 9 6 14 11 8 8 a b 9 0 13 21 18 2 10 e c 6 13 0 12 11 b c d 14 21 12 0 19 d e 11 18 11 19 0 Math/C SC 5610Computational Biology – p.12/25 Reconstructed Tree UPGMA Tree Original Tree 1 1 9 1 1 8 7.25 8 a 6.5 2 10 e b c 3 a c b e d d Math/C SC 5610Computational Biology – p.13/25 Transformed distances d¹r = 7:2. d1;2 ¡ d1;r ¡ d2;r 9 ¡ 2 ¡ 9 e1;2 = + d¹r = + 7:2 = 6:2: 2 2 0 6:2 7:2 7:2 7:2 2 6:2 0 7:2 7:2 7:2 3 7:2 7:2 0 5:2 6:2 6 7 6 7 6 7:2 7:2 5:2 0 6:2 7 6 7 6 7:2 7:2 6:2 6:2 0 7 6 7 4 5 Math/C SC 5610Computational Biology – p.14/25 Resulting Tree from Farris Transformation Farris Transformed Tree Original Tree 1 1 3.6 1 1 8 8 a 3.1 3.1 2 b 10 e 2.6 c a b c d e d Math/C SC 5610Computational Biology – p.15/25 But r isn’t known Farris transformation assumes that we know di;r. What do we do if we don’t know r? Take r to be a known outgroup. (i.e., a taxon far away from all the others). Determine outgroup to be taxon whose average distance from all others is maximum. Without knowing the root, the algorithm may or may not capture the correct topology. Math/C SC 5610Computational Biology – p.16/25 Neighbor Joining Main Idea: Join neighbors in such a way that a tree is created with the smallest possible branch length overall. Starts with a star-like tree. At each iteration, Search all possible pairs of neighbors to find pair of nodes whose joining results in the smallest total branch length for the overall tree. Join this pair of neighbors together. Math/C SC 5610Computational Biology – p.17/25 Maximum Likelihood Method Given n DNA sequences An evolutionary model (governing substitution rates) Find the phylogenetic tree with maximum likelihood. Determine tree topology Determine branch lengths Math/C SC 5610Computational Biology – p.18/25 Overview of ML method Start with initial tree topology for a small subset of the taxa. Use maximum-likelihood method to determine optimal branch lengths. Make local changes to topology and re-optimize branch lengths. Add new taxa one by one. Math/C SC 5610Computational Biology – p.19/25 Tree Likelihood L(tree) = P r(data j tree): 0 d4 d3 4 3 d1 d2 AT 1 2 CT CG How would we calculate the likelihood for this tree? Math/C SC 5610Computational Biology – p.20/25 Components of Likelihood Calculation For each possible assignment of strings to internal nodes Calculate probabilty of generating the strings at each node of the tree. (Prior distribution for root node) £ (prob of rest of strings given root). Sum up over all possible assignments for internal nodes. Math/C SC 5610Computational Biology – p.21/25 Simplifications Allow only substitutions (no inserts or deletes). All strings have the same length. Assume each string position is independent of the other positions. 3 d1 d2 1 2 1 2 3 P r(a = CA; a = AT; a = CGj tree) = 1 2 3 P r(a1 = C; a1 = A; a1 = Cj tree) 1 2 3 £P r(a2 = A; a2 = T; a2 = Gj tree) Math/C SC 5610Computational Biology – p.22/25 Example Consider the tree 0 C d4 d3 4 A 3 T d1 d2 AT 1 2 T G P r(a1 = T; a2 = G; a3 = T; a4 = A; a0 = Cjt(d1; d2; d3; d4)) = ¼CpC;A(d4)pA;T (d1)pA;G(d2)pC;T (d3) P r(a1 = T; a2 = G; a3 = T jt(d1; d2; d3; d4)) = s0;s4 ¼s0 ps0;s4 (d4)ps4;T (d1)ps4;G(d2)ps0;T (d3) P Math/C SC 5610Computational Biology – p.23/25 Recursive Definition for the Likelihood Define Lk;s = conditional likelihood of the subtree rooted at k, given that node k has state s. At leaves i 1 if the ith taxon has s at this site Li;s = ( 0 otherwise If k is the parent of i and j, then Lk;sk = psk;si (di)Li;si psk;sj (dj)Lj;sj Ã s ! 0 s 1 Xi Xj @ A and L = ¼s0 L0;s0 ; where 0 is the root node. s0 X Math/C SC 5610Computational Biology – p.24/25 Finding Optimal Branch Lengths Felsenstein’s Method: Optimize one distance at a time (leaving others fixed). Cycle through all arcs until convergence. Solution is not necessarily a (local) optimal. Requires that the evolutionary process is a reversible Markov process. Math/C SC 5610Computational Biology – p.25/25.

Math/C SC 5610 Computational Biology Lecture 12: Phylogenetics

An Introduction to Phylogenetic Analysis

Phylogeny Codon Models • Last Lecture: Poor Man’S Way of Calculating Dn/Ds (Ka/Ks) • Tabulate Synonymous/Non-Synonymous Substitutions • Normalize by the Possibilities

Understanding the Processes Underpinning Patterns Of

Clustering and Phylogenetic Approaches to Classification: Illustration on Stellar Tracks Didier Fraix-Burnet, Marc Thuillard

Phylogenetics

The Generalized Neighbor Joining Method

Rapid Neighbour-Joining

A Comparative Analysis of Popular Phylogenetic Reconstruction Algorithms

A Fast Neighbor Joining Method

Molecular Phylogenetics (Hannes Luz)

Phylogenetic Reconstruction and Divergence Time Estimation of Blumea DC

5 Computational Methods and Tools Introductory Remarks by the Chapter Editor, Joris Van Zundert