Lecture 10: Phylogeny 25,27/12/12 Phylogeny

Lecture 10: Phylogeny 25,27/12/12 Phylogeny

גנומיקה חישובית Computational Genomics פרופ' רון שמיר ופרופ' רודד שרן Prof. Ron Shamir & Prof. Roded Sharan ביה"ס למדעי המחשב אוניברסיטת תל אביב School of Computer Science, Tel Aviv University , Lecture 10: Phylogeny 25,27/12/12 Phylogeny Slides: • Adi Akavia • Nir Friedman’s slides at HUJI (based on ALGMB 98) •Anders Gorm Pedersen,Technical University of Denmark Sources: Joe Felsenstein “Inferring Phylogenies” (2004) 1 CG © Ron Shamir Phylogeny • Phylogeny: the ancestral relationship of a set of species. • Represented by a phylogenetic tree branch ? leaf ? Internal node ? ? ? Leaves - contemporary Internal nodes - ancestral 2 CG Branch© Ron Shamir length - distance between sequences 3 CG © Ron Shamir 4 CG © Ron Shamir 5 CG © Ron Shamir 6 CG © Ron Shamir “classical” Phylogeny schools Classical vs. Modern: • Classical - morphological characters • Modern - molecular sequences. 7 CG © Ron Shamir Trees and Models • rooted / unrooted “molecular • topology / distance clock” • binary / general 8 CG © Ron Shamir To root or not to root? • Unrooted tree: phylogeny without direction. 9 CG © Ron Shamir Rooting an Unrooted Tree • We can estimate the position of the root by introducing an outgroup: – a species that is definitely most distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant 10 CG © Ron Shamir A Scally et al. Nature 483, 169-175 (2012) doi:10.1038/nature10842 HOW DO WE FIGURE OUT THESE TREES? TIMES? 12 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) Sequence Homology Caused By: Gene Duplication •Orthologs - speciation, •Paralogs - duplication •Xenologs - horizontal (e.g., by virus) Speciation events 1A 2A 3A 3B 2B 1B 13 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) • If we only consider 1A, 2B, and 3A: ((1,3),2) Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B 14 CG © Ron Shamir Type of Data • Distance-based – Input: matrix of distances between species – Distance can be • fraction of residue they disagree on, • alignment score between them, • … • Character-based – Examine each character (e.g., residue) separately 15 CG © Ron Shamir Distance Based Methods 16 CG © Ron Shamir Tree based distances • d(i,j)=sum of arc lengths on the path ij • Given d, can we find – an exactly matching tree? – An approximately matching tree? 17 CG © Ron Shamir The Problem Input: matrix d of distances between species Goal: Find a tree with leaves=chars and edge distances, matching d best. Quality measure: sum of squares: 2 SSQ(T ) = ∑∑ wij (dij − tij ) i j≠i tij: distance in the tree 2 wij: pair weighting. Options: (1) ≡1 (2) 1/dij (3) 1/dij 18 CGNP © Ron- hardShamir (Day ’86). We’ll describe common heuristics UPGMA Clustering (Sokal & Michener 1958) (Unweighted pair-group method with arithmetic mean) • Approach: Form a tree; closer species according to input distances should be closer in the tree • Build the tree bottom up, each time merging two smaller trees • All leaves are at same distance from the root 19 CG © Ron Shamir Efficiency lemma • Approach: gradually form clusters: sets of species • Repeatedly identify two clusters and merge them. • For clusters Ci Cj , define the distance between them to be the average dist betw their members: 1 d (Ci ,C j ) = ∑ ∑d (p,q) |Ci ||C j | p∈Ci q∈C j • Lemma: If Ck is formed by merging Ci and Cj then for every other cluster Cl d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|) Can update distances between clusters in time prop. to the number of clusters. 20 CG © Ron Shamir UPGMA algorithm Initialize: each node is a cluster Ci={i}. d(Ci,Cj)=d(i,j) set height(i) = 0 ∀i Iterate: • Find Ci,Cj with smallest d(Ci,Cj) • Introduce a new cluster node Ck that replaces Ci and Cj //Ck represents all the leaves in clusters Ci and Cj • Introduce a new tree node Aij with height(Aij) = d(Ci,Cj)/2 // d(Ci,Cj) is the average dist among leaves of Ci and Cj • Connect the corresponding tree nodes Ci,Cj to Aij with length(Ci, Aij) = height(Aij) - height(Ci) length(Cj,Aij) = height(Aij) - height(Cj) • For all other Cl: d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|) //dist to any old cluster is the ave dist between its leaves and leaves in Ci, Cj Time: Naïve: O(n3); Can show O(n2 logn) (ex.) and O(n2) (harder ex.) 21 CG © Ron Shamir UPGMA alg (2) • Orange nodes represent the groups of nodes that they replaced, and maintain the average dist of the set from other leaf nodes/clusters A12345 h123 A123 h12 A12 h12 A45 1 2 3 4 5 C 3 C123 C 12 C45 45 1 22 22 CG © Ron Shamir,Shamir ‘09 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 23 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 24 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 25 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 26 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 27 Molecular Clock • UPGMA implicitly assumes the tree is ultrametric: equal leaf-root distances => common uniform clock 2 3 2 3 4 1 1 4 • Works reasonably well for nearby species 28 CG © Ron Shamir Additivity • A weaker additivity assumption: distances between species are the sum of distances between intermediate nodes (even if the tree is not ultrametric) k = + c d (i , j ) a b b d (i ,k ) = a + c j a d (j,k ) = b + c i 29 CG © Ron Shamir Consequences of Additivity • Suppose input distances are additive • For any three leaves k d (i , j ) = a + b c b d (i ,k ) = a + c j a m d (j,k ) = b + c i • Thus 1 d (m,k ) = (d (i ,k ) + d (j,k ) − d (i , j )) 30 CG © Ron Shamir 2 Consequences of Additivity II • Suppose input distances are additive • For any four leaves • Induced tree wt: W k j d(i, j) + d(k,l) = W + b c d d(i,l) + d( j,k) = W + b b a d(i,k) + d( j,l) = W − b e l i • Largest sum must appear twice 31 CG © Ron Shamir Consequences of Additivity III • If we can identify neighbor leaves, then can use pairwise distances to reconstruct the tree: • Remove neighbors i, j from the leaf set • Add k i m • Set dkm= (dim + djm –dij)/2 k But can we find neighbor leaves? b 32 CG © Ron Shamir j Closest pairs may not be neighbors! • Closest pair: kj even though they are not neighbors k j 1 1 1 4 4 i l 33 CG © Ron Shamir Neighbor Joining (Saitou-Nei ’87) • Let D(i , j ) = d (i , j ) − (ri + rj ) where 1 = “Corrected” average ri ∑ d(i,k) distance of i from all | L | −2 k other nodes Theorem: if D(i,j) is minimum among all pairs of leaves, then i and j are neighbors in the tree 34 CG © Ron Shamir 1 r = d(i,k) i − ∑ Neighbor Joining | L | 2 k D(i , j ) = d (i , j ) − (ri + rj ) • Set L to contain all leaves Iteration: • Choose i,j such that D(i,j) is minimum • Create new node k, and set i m d(i,k) = (d(i, j) + r − r ) / 2 i j k = + − d( j,k) (d(i, j) rj ri ) / 2 d(k,m) = (d(i,m) + d( j,m) − d(i, j)) / 2 j • remove i,j from L, and add k Time: O(n3) • Update r, D • Termination: when |L| =2 connect the two nodes Thm: Opt tree guaranteed if distances match a tree 35 Does CG © Ron notShamir assume a clock Neighbor Joining Algorithm A B C D A - 17 21 27 B - 12 18 C - 14 D - CENTER FOR BIOLOGICALSEQUENCE ANALYSIS Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 C D CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 X C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    89 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us