Lecture 10: Phylogeny 25,27/12/12 Phylogeny

גנומיקה חישובית Computational Genomics פרופ' רון שמיר ופרופ' רודד שרן Prof. Ron Shamir & Prof. Roded Sharan ביה"ס למדעי המחשב אוניברסיטת תל אביב School of Computer Science, Tel Aviv University , Lecture 10: Phylogeny 25,27/12/12 Phylogeny Slides: • Adi Akavia • Nir Friedman’s slides at HUJI (based on ALGMB 98) •Anders Gorm Pedersen,Technical University of Denmark Sources: Joe Felsenstein “Inferring Phylogenies” (2004) 1 CG © Ron Shamir Phylogeny • Phylogeny: the ancestral relationship of a set of species. • Represented by a phylogenetic tree branch ? leaf ? Internal node ? ? ? Leaves - contemporary Internal nodes - ancestral 2 CG Branch© Ron Shamir length - distance between sequences 3 CG © Ron Shamir 4 CG © Ron Shamir 5 CG © Ron Shamir 6 CG © Ron Shamir “classical” Phylogeny schools Classical vs. Modern: • Classical - morphological characters • Modern - molecular sequences. 7 CG © Ron Shamir Trees and Models • rooted / unrooted “molecular • topology / distance clock” • binary / general 8 CG © Ron Shamir To root or not to root? • Unrooted tree: phylogeny without direction. 9 CG © Ron Shamir Rooting an Unrooted Tree • We can estimate the position of the root by introducing an outgroup: – a species that is definitely most distant from all the species of interest Proposed root Falcon Aardvark Bison Chimp Dog Elephant 10 CG © Ron Shamir A Scally et al. Nature 483, 169-175 (2012) doi:10.1038/nature10842 HOW DO WE FIGURE OUT THESE TREES? TIMES? 12 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) Sequence Homology Caused By: Gene Duplication •Orthologs - speciation, •Paralogs - duplication •Xenologs - horizontal (e.g., by virus) Speciation events 1A 2A 3A 3B 2B 1B 13 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) • If we only consider 1A, 2B, and 3A: ((1,3),2) Gene Duplication Speciation events 1A 2A 3A 3B 2B 1B 14 CG © Ron Shamir Type of Data • Distance-based – Input: matrix of distances between species – Distance can be • fraction of residue they disagree on, • alignment score between them, • … • Character-based – Examine each character (e.g., residue) separately 15 CG © Ron Shamir Distance Based Methods 16 CG © Ron Shamir Tree based distances • d(i,j)=sum of arc lengths on the path ij • Given d, can we find – an exactly matching tree? – An approximately matching tree? 17 CG © Ron Shamir The Problem Input: matrix d of distances between species Goal: Find a tree with leaves=chars and edge distances, matching d best. Quality measure: sum of squares: 2 SSQ(T ) = ∑∑ wij (dij − tij ) i j≠i tij: distance in the tree 2 wij: pair weighting. Options: (1) ≡1 (2) 1/dij (3) 1/dij 18 CGNP © Ron- hardShamir (Day ’86). We’ll describe common heuristics UPGMA Clustering (Sokal & Michener 1958) (Unweighted pair-group method with arithmetic mean) • Approach: Form a tree; closer species according to input distances should be closer in the tree • Build the tree bottom up, each time merging two smaller trees • All leaves are at same distance from the root 19 CG © Ron Shamir Efficiency lemma • Approach: gradually form clusters: sets of species • Repeatedly identify two clusters and merge them. • For clusters Ci Cj , define the distance between them to be the average dist betw their members: 1 d (Ci ,C j ) = ∑ ∑d (p,q) |Ci ||C j | p∈Ci q∈C j • Lemma: If Ck is formed by merging Ci and Cj then for every other cluster Cl d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|) Can update distances between clusters in time prop. to the number of clusters. 20 CG © Ron Shamir UPGMA algorithm Initialize: each node is a cluster Ci={i}. d(Ci,Cj)=d(i,j) set height(i) = 0 ∀i Iterate: • Find Ci,Cj with smallest d(Ci,Cj) • Introduce a new cluster node Ck that replaces Ci and Cj //Ck represents all the leaves in clusters Ci and Cj • Introduce a new tree node Aij with height(Aij) = d(Ci,Cj)/2 // d(Ci,Cj) is the average dist among leaves of Ci and Cj • Connect the corresponding tree nodes Ci,Cj to Aij with length(Ci, Aij) = height(Aij) - height(Ci) length(Cj,Aij) = height(Aij) - height(Cj) • For all other Cl: d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|) //dist to any old cluster is the ave dist between its leaves and leaves in Ci, Cj Time: Naïve: O(n3); Can show O(n2 logn) (ex.) and O(n2) (harder ex.) 21 CG © Ron Shamir UPGMA alg (2) • Orange nodes represent the groups of nodes that they replaced, and maintain the average dist of the set from other leaf nodes/clusters A12345 h123 A123 h12 A12 h12 A45 1 2 3 4 5 C 3 C123 C 12 C45 45 1 22 22 CG © Ron Shamir,Shamir ‘09 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 23 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 24 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 25 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 26 UPGMA Algorithm 4 3 5 2 1 1 2 3 5 4 CG © Ron Shamir 27 Molecular Clock • UPGMA implicitly assumes the tree is ultrametric: equal leaf-root distances => common uniform clock 2 3 2 3 4 1 1 4 • Works reasonably well for nearby species 28 CG © Ron Shamir Additivity • A weaker additivity assumption: distances between species are the sum of distances between intermediate nodes (even if the tree is not ultrametric) k = + c d (i , j ) a b b d (i ,k ) = a + c j a d (j,k ) = b + c i 29 CG © Ron Shamir Consequences of Additivity • Suppose input distances are additive • For any three leaves k d (i , j ) = a + b c b d (i ,k ) = a + c j a m d (j,k ) = b + c i • Thus 1 d (m,k ) = (d (i ,k ) + d (j,k ) − d (i , j )) 30 CG © Ron Shamir 2 Consequences of Additivity II • Suppose input distances are additive • For any four leaves • Induced tree wt: W k j d(i, j) + d(k,l) = W + b c d d(i,l) + d( j,k) = W + b b a d(i,k) + d( j,l) = W − b e l i • Largest sum must appear twice 31 CG © Ron Shamir Consequences of Additivity III • If we can identify neighbor leaves, then can use pairwise distances to reconstruct the tree: • Remove neighbors i, j from the leaf set • Add k i m • Set dkm= (dim + djm –dij)/2 k But can we find neighbor leaves? b 32 CG © Ron Shamir j Closest pairs may not be neighbors! • Closest pair: kj even though they are not neighbors k j 1 1 1 4 4 i l 33 CG © Ron Shamir Neighbor Joining (Saitou-Nei ’87) • Let D(i , j ) = d (i , j ) − (ri + rj ) where 1 = “Corrected” average ri ∑ d(i,k) distance of i from all | L | −2 k other nodes Theorem: if D(i,j) is minimum among all pairs of leaves, then i and j are neighbors in the tree 34 CG © Ron Shamir 1 r = d(i,k) i − ∑ Neighbor Joining | L | 2 k D(i , j ) = d (i , j ) − (ri + rj ) • Set L to contain all leaves Iteration: • Choose i,j such that D(i,j) is minimum • Create new node k, and set i m d(i,k) = (d(i, j) + r − r ) / 2 i j k = + − d( j,k) (d(i, j) rj ri ) / 2 d(k,m) = (d(i,m) + d( j,m) − d(i, j)) / 2 j • remove i,j from L, and add k Time: O(n3) • Update r, D • Termination: when |L| =2 connect the two nodes Thm: Opt tree guaranteed if distances match a tree 35 Does CG © Ron notShamir assume a clock Neighbor Joining Algorithm A B C D A - 17 21 27 B - 12 18 C - 14 D - CENTER FOR BIOLOGICALSEQUENCE ANALYSIS Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32. 5 B - 12 18 B ( 17+12+18)/2=23. 5 C - 14 C ( 21+12+14)/2=23. 5 D - D ( 27+18+14)/2=29. 5 A B C D A - - 39 - 35 - 35 C D CENTER FOR BIOLOGICALSEQUENCE ANALYSIS B - - 35 - 35 X C - - 39 D - Dij- ui - uj Neighbor Joining Algorithm A B C D i ui A - 17 21 27 A ( 17+21+27)/2=32.

Lecture 10: Phylogeny 25,27/12/12 Phylogeny

Bonnie Berger Named ISCB 2019 ISCB Accomplishments by a Senior

DREAM: a Dialogue on Reverse Engineering Assessment And

Research Report 2006 Max Planck Institute for Molecular Genetics, Berlin Imprint | Research Report 2006

Machine Learning and Statistical Methods for Clustering Single-Cell RNA-Sequencing Data Raphael Petegrosso 1, Zhuliu Li 1 and Rui Kuang 1,∗

Annual Meeting 2016

Personal Data Education Current Research Interests Academic

Igor Ulitsky – CV September 2016

Front Matter

Biclustering Usinig Message Passing

Biclustering Algorithms for Biological Data Analysis: a Survey* Sara C

Reconstructing Cancer Karyotypes from Short Read Data: the Half Empty and Half Full Glass Rami Eitan and Ron Shamir*

Opportunities and Obstacles for Deep Learning in Biology and Medicine