Math/C SC 5610 Computational Biology Lecture 12: Phylogenetics
Stephen Billups
University of Colorado at Denver
Math/C SC 5610Computational Biology – p.1/25 Announcements
Project Guidelines and Ideas are posted. (proposal due March 8) CCB Seminar, Friday (Mar. 4) Speaker: Jack Horner, SAIC Title: Phylogenetic Methods for Characterizing the Signature of Stage I Ovarian Cancer in Serum Protein Mas Time: 11-12 (Followed by lunch) Place: Media Center, AU008
Math/C SC 5610Computational Biology – p.2/25 Outline
Distance based methods for phylogenetics UPGMA WPGMA Neighbor-Joining Character based methods Maximum Likelihood Maximum Parsimony
Math/C SC 5610Computational Biology – p.3/25 Review: Distance Based Clustering Methods
Main Idea: Requires a distance matrix D, (defining distances between each pair of elements). Repeatedly group together closest elements. Different algorithms differ by how they treat distances between groups. UPGMA (unweighted pair group method with arithmetic mean). WPGMA (weighted pair group method with arithmetic mean).
Math/C SC 5610Computational Biology – p.4/25 UPGMA
1. Initialize C to the n singleton clusters {1}, . . . , {n}. 2. Initialize dist(c, d) on C by defining
dist({i}, {j}) = D(i, j).
3. Repeat n 1 times: (a) determine pair c, d of clusters in C such that dist(c, d) is minimal; define dmin = dist(c, d). (b) define new cluster e = c S d; update C = C {c, d} S{e}. (c) define a node with label e and daughters c, d, where e has distance dmin/2 to its leaves. (d) define for all f ∈ C with f 6= e,
dist(c, f) + dist(d, f) dist(e, f) = dist(f, e) = . (avg. of previous distances) 2
Math/C SC 5610Computational Biology – p.5/25 Example
Math/C SC 5610Computational Biology – p.6/25 WPGMA
1. Initialize C to the n singleton clusters {1}, . . . , {n}. 2. Initialize dist(c, d) on C be defining
dist({i}, {j}) = D(i, j).
3. Repeat n 1 times: (a) determine pair c, d of clusters in C such that dist(c, d) is minimal; define dmin = dist(c, d). (b) define new cluster e = c S d; update C = C {c, d} S{e}. (c) define a node with label e and daughters c, d, where e has distance dmin/2 to its leaves. (d) define for all f ∈ C with f 6= e,
|c| dist(c, f) + |d| dist(d, f) dist(e, f) = dist(f, e) = . (weighted avg.) |c| + |d|
Math/C SC 5610Computational Biology – p.7/25 Ultrametric Trees
Key feature: distance to root is the same for every leaf node. (distance time since divergence).
Given a tree with positive edge weights (i, j)–if the value di,j of the distance function between leaves i and j is the sum of the edge weights along the path connecting i and j, then the distance function d is an additive metric. If the path length from the root r to every leaf is identical, then the distance function d is called an ultrametric.
Math/C SC 5610Computational Biology – p.8/25 Ultrametric Trees and UPGMA
If the distances between taxa are an ultrametric (for some tree), then UPGMA always correctly constructs the original topology.
If the distances are not an ultrametric (but are still additive), then UPGMA yields a tree with incorrect topology and incorrect branch lengths. One method of fixing this problem is the Farris Transformed Distance Method.
Math/C SC 5610Computational Biology – p.9/25 Tree Topology
The topology of a phylogenetic tree on n taxa is an unweighted tree t = (V, E) whose leaves are the n taxa. Associated with a given topology t, we can define a weighted tree t(d1, . . . , dk), where dj is the length of the edge connect- ing node j to its parent.
Math/C SC 5610Computational Biology – p.10/25 The Farris Transformation
Let T be a tree with root r and leaves 1, . . . , n, and define
di,j = length of path connecting i and j.
(Note that d is additive). Define new distances
di,j di,r dj,r ei,j = + d r, 2 where d r is the average distance between r and the leaves. Theorem: UPGMA applied to the transformed distances generates the correct topology of T .
Math/C SC 5610Computational Biology – p.11/25 Example
Additive Metric: 1 1 a b c d e 1 1 a 0 9 6 14 11 8 8 a b 9 0 13 21 18 2 10 e c 6 13 0 12 11 b c d 14 21 12 0 19 d e 11 18 11 19 0
Math/C SC 5610Computational Biology – p.12/25 Reconstructed Tree
UPGMA Tree Original Tree 1 1 9 1 1 8 7.25 8 a 6.5 2 10 e b c 3 a c b e d d
Math/C SC 5610Computational Biology – p.13/25 Transformed distances
d r = 7.2.
d1,2 d1,r d2,r 9 2 9 e1,2 = + d r = + 7.2 = 6.2. 2 2
0 6.2 7.2 7.2 7.2 6.2 0 7.2 7.2 7.2 7.2 7.2 0 5.2 6.2 7.2 7.2 5.2 0 6.2 7.2 7.2 6.2 6.2 0
Math/C SC 5610Computational Biology – p.14/25 Resulting Tree from Farris Transformation
Farris Transformed Tree Original Tree 1 1
3.6 1 1 8 8 a 3.1 3.1 2 b 10 e 2.6 c a b c d e d
Math/C SC 5610Computational Biology – p.15/25 But r isn’t known
Farris transformation assumes that we know di,r. What do we do if we don’t know r? Take r to be a known outgroup. (i.e., a taxon far away from all the others). Determine outgroup to be taxon whose average distance from all others is maximum. Without knowing the root, the algorithm may or may not capture the correct topology.
Math/C SC 5610Computational Biology – p.16/25 Neighbor Joining
Main Idea: Join neighbors in such a way that a tree is created with the smallest possible branch length overall. Starts with a star-like tree. At each iteration, Search all possible pairs of neighbors to find pair of nodes whose joining results in the smallest total branch length for the overall tree. Join this pair of neighbors together.
Math/C SC 5610Computational Biology – p.17/25 Maximum Likelihood Method
Given n DNA sequences An evolutionary model (governing substitution rates) Find the phylogenetic tree with maximum likelihood. Determine tree topology Determine branch lengths
Math/C SC 5610Computational Biology – p.18/25 Overview of ML method
Start with initial tree topology for a small subset of the taxa. Use maximum-likelihood method to determine optimal branch lengths. Make local changes to topology and re-optimize branch lengths. Add new taxa one by one.
Math/C SC 5610Computational Biology – p.19/25 Tree Likelihood
L(tree) = P r(data | tree).
0
d4 d3 4 3 d1 d2 AT 1 2
CT CG How would we calculate the likelihood for this tree?
Math/C SC 5610Computational Biology – p.20/25 Components of Likelihood Calculation
For each possible assignment of strings to internal nodes Calculate probabilty of generating the strings at each node of the tree. (Prior distribution for root node) (prob of rest of strings given root). Sum up over all possible assignments for internal nodes.
Math/C SC 5610Computational Biology – p.21/25 Simplifications
Allow only substitutions (no inserts or deletes). All strings have the same length. Assume each string position is independent of the other positions. 3 d1 d2 1 2
1 2 3 P r(a = CA, a = AT, a = CG| tree) = 1 2 3 P r(a1 = C, a1 = A, a1 = C| tree) 1 2 3 P r(a2 = A, a2 = T, a2 = G| tree)
Math/C SC 5610Computational Biology – p.22/25 Example
Consider the tree 0 C
d4 d3 4 A 3 T d1 d2 AT 1 2 T G
P r(a1 = T, a2 = G, a3 = T, a4 = A, a0 = C|t(d1, d2, d3, d4)) = CpC,A(d4)pA,T (d1)pA,G(d2)pC,T (d3)
P r(a1 = T, a2 = G, a3 = T |t(d1, d2, d3, d4)) =
s0,s4 s0 ps0,s4 (d4)ps4,T (d1)ps4,G(d2)ps0,T (d3) P Math/C SC 5610Computational Biology – p.23/25 Recursive Definition for the Likelihood
Define Lk,s = conditional likelihood of the subtree rooted at k, given that node k has state s. At leaves i
1 if the ith taxon has s at this site Li,s = ( 0 otherwise
If k is the parent of i and j, then
Lk,sk = psk,si (di)Li,si psk,sj (dj)Lj,sj às ! s Xi Xj
and L = s0 L0,s0 , where 0 is the root node. s0 X
Math/C SC 5610Computational Biology – p.24/25 Finding Optimal Branch Lengths
Felsenstein’s Method: Optimize one distance at a time (leaving others fixed). Cycle through all arcs until convergence. Solution is not necessarily a (local) optimal. Requires that the evolutionary process is a reversible Markov process.
Math/C SC 5610Computational Biology – p.25/25