<<

Math/C SC 5610 Computational Biology Lecture 12:

Stephen Billups

University of Colorado at Denver

Math/C SC 5610Computational Biology – p.1/25 Announcements

Project Guidelines and Ideas are posted. (proposal due March 8) CCB Seminar, Friday (Mar. 4) Speaker: Jack Horner, SAIC Title: Phylogenetic Methods for Characterizing the Signature of Stage I Ovarian Cancer in Serum Mas Time: 11-12 (Followed by lunch) Place: Media Center, AU008

Math/C SC 5610Computational Biology – p.2/25 Outline

Distance based methods for phylogenetics UPGMA WPGMA Neighbor-Joining Character based methods Maximum Likelihood Maximum Parsimony

Math/C SC 5610Computational Biology – p.3/25 Review: Distance Based Clustering Methods

Main Idea: Requires a distance matrix D, (defining distances between each pair of elements). Repeatedly group together closest elements. Different algorithms differ by how they treat distances between groups. UPGMA (unweighted pair group method with arithmetic mean). WPGMA (weighted pair group method with arithmetic mean).

Math/C SC 5610Computational Biology – p.4/25 UPGMA

1. Initialize C to the n singleton clusters {1}, . . . , {n}. 2. Initialize dist(c, d) on C by defining

dist({i}, {j}) = D(i, j).

3. Repeat n 1 times: (a) determine pair c, d of clusters in C such that dist(c, d) is minimal; define dmin = dist(c, d). (b) define new cluster e = c S d; update C = C {c, d} S{e}. (c) define a node with label e and daughters c, d, where e has distance dmin/2 to its leaves. (d) define for all f ∈ C with f 6= e,

dist(c, f) + dist(d, f) dist(e, f) = dist(f, e) = . (avg. of previous distances) 2

Math/C SC 5610Computational Biology – p.5/25 Example

Math/C SC 5610Computational Biology – p.6/25 WPGMA

1. Initialize C to the n singleton clusters {1}, . . . , {n}. 2. Initialize dist(c, d) on C be defining

dist({i}, {j}) = D(i, j).

3. Repeat n 1 times: (a) determine pair c, d of clusters in C such that dist(c, d) is minimal; define dmin = dist(c, d). (b) define new cluster e = c S d; update C = C {c, d} S{e}. (c) define a node with label e and daughters c, d, where e has distance dmin/2 to its leaves. (d) define for all f ∈ C with f 6= e,

|c| dist(c, f) + |d| dist(d, f) dist(e, f) = dist(f, e) = . (weighted avg.) |c| + |d|

Math/C SC 5610Computational Biology – p.7/25 Ultrametric Trees

Key feature: distance to root is the same for every leaf node. (distance time since divergence).

Given a tree with positive edge weights (i, j)–if the value di,j of the distance function between leaves i and j is the sum of the edge weights along the path connecting i and j, then the distance function d is an additive metric. If the path length from the root r to every leaf is identical, then the distance function d is called an ultrametric.

Math/C SC 5610Computational Biology – p.8/25 Ultrametric Trees and UPGMA

If the distances between taxa are an ultrametric (for some tree), then UPGMA always correctly constructs the original topology.

If the distances are not an ultrametric (but are still additive), then UPGMA yields a tree with incorrect topology and incorrect branch lengths. One method of fixing this problem is the Farris Transformed Distance Method.

Math/C SC 5610Computational Biology – p.9/25 Tree Topology

The topology of a on n taxa is an unweighted tree t = (V, E) whose leaves are the n taxa. Associated with a given topology t, we can define a weighted tree t(d1, . . . , dk), where dj is the length of the edge connect- ing node j to its parent.

Math/C SC 5610Computational Biology – p.10/25 The Farris Transformation

Let T be a tree with root r and leaves 1, . . . , n, and define

di,j = length of path connecting i and j.

(Note that d is additive). Define new distances

di,j di,r dj,r ei,j = + dr, 2 where dr is the average distance between r and the leaves. Theorem: UPGMA applied to the transformed distances generates the correct topology of T .

Math/C SC 5610Computational Biology – p.11/25 Example

Additive Metric: 1 1 a b c d e 1 1 a 0 9 6 14 11 8 8 a b 9 0 13 21 18 2 10 e c 6 13 0 12 11 b c d 14 21 12 0 19 d e 11 18 11 19 0

Math/C SC 5610Computational Biology – p.12/25 Reconstructed Tree

UPGMA Tree Original Tree 1 1 9 1 1 8 7.25 8 a 6.5 2 10 e b c 3 a c b e d d

Math/C SC 5610Computational Biology – p.13/25 Transformed distances

dr = 7.2.

d1,2 d1,r d2,r 9 2 9 e1,2 = + dr = + 7.2 = 6.2. 2 2

0 6.2 7.2 7.2 7.2  6.2 0 7.2 7.2 7.2  7.2 7.2 0 5.2 6.2      7.2 7.2 5.2 0 6.2     7.2 7.2 6.2 6.2 0     

Math/C SC 5610Computational Biology – p.14/25 Resulting Tree from Farris Transformation

Farris Transformed Tree Original Tree 1 1

3.6 1 1 8 8 a 3.1 3.1 2 b 10 e 2.6 c a b c d e d

Math/C SC 5610Computational Biology – p.15/25 But r isn’t known

Farris transformation assumes that we know di,r. What do we do if we don’t know r? Take r to be a known . (i.e., a far away from all the others). Determine outgroup to be taxon whose average distance from all others is maximum. Without knowing the root, the algorithm may or may not capture the correct topology.

Math/C SC 5610Computational Biology – p.16/25

Main Idea: Join neighbors in such a way that a tree is created with the smallest possible branch length overall. Starts with a star-like tree. At each iteration, Search all possible pairs of neighbors to find pair of nodes whose joining results in the smallest total branch length for the overall tree. Join this pair of neighbors together.

Math/C SC 5610Computational Biology – p.17/25 Maximum Likelihood Method

Given n DNA sequences An evolutionary model (governing substitution rates) Find the phylogenetic tree with maximum likelihood. Determine tree topology Determine branch lengths

Math/C SC 5610Computational Biology – p.18/25 Overview of ML method

Start with initial tree topology for a small subset of the taxa. Use maximum-likelihood method to determine optimal branch lengths. Make local changes to topology and re-optimize branch lengths. Add new taxa one by one.

Math/C SC 5610Computational Biology – p.19/25 Tree Likelihood

L(tree) = P r(data | tree).

0

d4 d3 4 3 d1 d2 AT 1 2

CT CG How would we calculate the likelihood for this tree?

Math/C SC 5610Computational Biology – p.20/25 Components of Likelihood Calculation

For each possible assignment of strings to internal nodes Calculate probabilty of generating the strings at each node of the tree. (Prior distribution for root node) (prob of rest of strings given root). Sum up over all possible assignments for internal nodes.

Math/C SC 5610Computational Biology – p.21/25 Simplifications

Allow only substitutions (no inserts or deletes). All strings have the same length. Assume each string position is independent of the other positions. 3 d1 d2 1 2

1 2 3 P r(a = CA, a = AT, a = CG| tree) = 1 2 3 P r(a1 = C, a1 = A, a1 = C| tree) 1 2 3 P r(a2 = A, a2 = T, a2 = G| tree)

Math/C SC 5610Computational Biology – p.22/25 Example

Consider the tree 0 C

d4 d3 4 A 3 T d1 d2 AT 1 2 T G

P r(a1 = T, a2 = G, a3 = T, a4 = A, a0 = C|t(d1, d2, d3, d4)) = CpC,A(d4)pA,T (d1)pA,G(d2)pC,T (d3)

P r(a1 = T, a2 = G, a3 = T |t(d1, d2, d3, d4)) =

s0,s4 s0 ps0,s4 (d4)ps4,T (d1)ps4,G(d2)ps0,T (d3) P Math/C SC 5610Computational Biology – p.23/25 Recursive Definition for the Likelihood

Define Lk,s = conditional likelihood of the subtree rooted at k, given that node k has state s. At leaves i

1 if the ith taxon has s at this site Li,s = ( 0 otherwise

If k is the parent of i and j, then

Lk,sk = psk,si (di)Li,si psk,sj (dj)Lj,sj à s !  s  Xi Xj  

and L = s0 L0,s0 , where 0 is the root node. s0 X

Math/C SC 5610Computational Biology – p.24/25 Finding Optimal Branch Lengths

Felsenstein’s Method: Optimize one distance at a time (leaving others fixed). Cycle through all arcs until convergence. Solution is not necessarily a (local) optimal. Requires that the evolutionary process is a reversible Markov process.

Math/C SC 5610Computational Biology – p.25/25