גנומיקה חישובית Computational Genomics פרופ' רון שמיר ופרופ' רודד שרן Prof. Ron Shamir & Prof. Roded Sharan ביה"ס למדעי המחשב אוניברסיטת תל אביב School of , Tel Aviv University ,

Lecture 10: Phylogeny 25,27/12/12 Phylogeny

Slides: • Adi Akavia • ’s slides at HUJI (based on ALGMB 98) •Anders Gorm Pedersen,Technical University of Denmark

Sources: Joe Felsenstein “Inferring Phylogenies” (2004)

1 CG © Ron Shamir Phylogeny

• Phylogeny: the ancestral relationship of a set of species. • Represented by a phylogenetic tree branch ? leaf ? Internal node ? ? ?

Leaves - contemporary

Internal nodes - ancestral 2 CG Branch© Ron Shamir length - distance between sequences

3 CG © Ron Shamir

4 CG © Ron Shamir

5 CG © Ron Shamir

6 CG © Ron Shamir “classical” Phylogeny schools

Classical vs. Modern: • Classical - morphological characters • Modern - molecular sequences.

7 CG © Ron Shamir Trees and Models

• rooted / unrooted “molecular • topology / distance clock” • binary / general

8 CG © Ron Shamir To root or not to root?

• Unrooted tree: phylogeny without direction.

9 CG © Ron Shamir Rooting an Unrooted Tree

• We can estimate the position of the root by introducing an outgroup: – a species that is definitely most distant from all the species of interest Proposed root

Falcon

Aardvark Bison Chimp Dog Elephant

10 CG © Ron Shamir A Scally et al. Nature 483, 169-175 (2012) doi:10.1038/nature10842 HOW DO WE FIGURE OUT THESE TREES? TIMES?

12 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3))

Sequence Homology Caused By: Gene Duplication •Orthologs - speciation, •Paralogs - duplication •Xenologs - horizontal (e.g., by virus)

Speciation events

1A 2A 3A 3B 2B 1B

13 CG © Ron Shamir Dangers of Paralogs • Right species distance: (1,(2,3)) • If we only consider 1A, 2B, and 3A: ((1,3),2) Gene Duplication

Speciation events

1A 2A 3A 3B 2B 1B 14 CG © Ron Shamir Type of Data • Distance-based – Input: matrix of distances between species – Distance can be • fraction of residue they disagree on, • alignment score between them, • … • Character-based – Examine each character (e.g., residue) separately

15 CG © Ron Shamir Distance Based Methods

16 CG © Ron Shamir Tree based distances

• d(i,j)=sum of arc lengths on the path ij

• Given d, can we find – an exactly matching tree? – An approximately

matching tree? 17 CG © Ron Shamir The Problem

Input: matrix d of distances between species Goal: Find a tree with leaves=chars and edge distances, matching d best. Quality measure: sum of squares: 2 SSQ(T ) = ∑∑ wij (dij − tij ) i j≠i tij: distance in the tree 2 wij: pair weighting. Options: (1) ≡1 (2) 1/dij (3) 1/dij 18 CGNP © Ron- hardShamir (Day ’86). We’ll describe common heuristics UPGMA Clustering (Sokal & Michener 1958) (Unweighted pair-group method with arithmetic mean) • Approach: Form a tree; closer species according to input distances should be closer in the tree • Build the tree bottom up, each time merging two smaller trees • All leaves are at same distance from the root

19 CG © Ron Shamir Efficiency lemma • Approach: gradually form clusters: sets of species • Repeatedly identify two clusters and merge them.

• For clusters Ci Cj , define the distance between them to be the average dist betw their members: 1 d (Ci ,C j ) = ∑ ∑d (p,q) |Ci ||C j | p∈Ci q∈C j

• Lemma: If Ck is formed by merging Ci and Cj then for every other cluster Cl

d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|)  Can update distances between clusters in time prop. to the number of clusters. 20 CG © Ron Shamir UPGMA algorithm

Initialize: each node is a cluster Ci={i}. d(Ci,Cj)=d(i,j) set height(i) = 0 ∀i Iterate: • Find Ci,Cj with smallest d(Ci,Cj) • Introduce a new cluster node Ck that replaces Ci and Cj //Ck represents all the leaves in clusters Ci and Cj • Introduce a new tree node Aij with height(Aij) = d(Ci,Cj)/2 // d(Ci,Cj) is the average dist among leaves of Ci and Cj • Connect the corresponding tree nodes Ci,Cj to Aij with length(Ci, Aij) = height(Aij) - height(Ci) length(Cj,Aij) = height(Aij) - height(Cj) • For all other Cl: d(Ck,Cl) = (|Ci|*d(Ci,Cl) + |Cj|*d(Cj,Cl)) / (|Ci| + |Cj|) //dist to any old cluster is the ave dist between its leaves and leaves in Ci, Cj Time: Naïve: O(n3); Can show O(n2 logn) (ex.) and O(n2) (harder ex.) 21 CG © Ron Shamir UPGMA alg (2)

• Orange nodes represent the groups of nodes that they replaced, and maintain the average dist of the set from other leaf nodes/clusters

A12345

h123 A123 h12 A12 h12 A45

1 2 3 4 5 C 3 C123 C 12 C45 45

1 22 22 CG © Ron Shamir,Shamir ‘09 UPGMA Algorithm

4 3 5

2 1

1 2 3 5 4

CG © Ron Shamir 23 UPGMA Algorithm

4 3 5

2 1

1 2 3 5 4

CG © Ron Shamir 24 UPGMA Algorithm

4 3 5

2 1

1 2 3 5 4

CG © Ron Shamir 25 UPGMA Algorithm

4 3 5

2 1

1 2 3 5 4

CG © Ron Shamir 26 UPGMA Algorithm

4 3 5

2 1

1 2 3 5 4

CG © Ron Shamir 27 Molecular Clock • UPGMA implicitly assumes the tree is ultrametric: equal leaf-root distances => common uniform clock

2 3

2 3 4 1 1 4 • Works reasonably well for nearby species

28 CG © Ron Shamir Additivity

• A weaker additivity assumption: distances between species are the sum of distances between intermediate nodes (even if the tree is not ultrametric) k = + c d (i , j ) a b b d (i ,k ) = a + c j a d (j,k ) = b + c i 29 CG © Ron Shamir Consequences of Additivity

• Suppose input distances are additive • For any three leaves k d (i , j ) = a + b c

b d (i ,k ) = a + c j a m = + d (j,k ) b c i • Thus 1 d (m,k ) = (d (i ,k ) + d (j,k ) − d (i , j )) 30 CG © Ron Shamir 2 Consequences of Additivity II • Suppose input distances are additive • For any four leaves • Induced tree wt: W k j d(i, j) + d(k,l) = W + b c d d(i,l) + d( j,k) = W + b b a d(i,k) + d( j,l) = W − b

e l i • Largest sum must appear twice

31 CG © Ron Shamir Consequences of Additivity III • If we can identify neighbor leaves, then can use pairwise distances to reconstruct the tree: • Remove neighbors i, j from the leaf set • Add k i m • Set dkm= (dim + djm –dij)/2 k But can we find neighbor leaves?

b 32 CG © Ron Shamir j Closest pairs may not be neighbors! • Closest pair: kj even though they are not neighbors k j 1 1 1

4 4

i

l 33 CG © Ron Shamir Neighbor Joining (Saitou-Nei ’87)

• Let D(i , j ) = d (i , j ) − (ri + rj ) where

1 = “Corrected” average ri ∑ d(i,k) distance of i from all | L | −2 k other nodes

Theorem: if D(i,j) is minimum among all pairs of leaves, then i and j are neighbors in the tree

34 CG © Ron Shamir 1 r = d(i,k) i − ∑ Neighbor Joining | L | 2 k D(i , j ) = d (i , j ) − (ri + rj ) • Set L to contain all leaves Iteration: • Choose i,j such that D(i,j) is minimum • Create new node k, and set i m d(i,k) = (d(i, j) + r − r ) / 2 i j k = + − d( j,k) (d(i, j) rj ri ) / 2 d(k,m) = (d(i,m) + d( j,m) − d(i, j)) / 2 j • remove i,j from L, and add k Time: O(n3) • Update r, D • Termination: when |L| =2 connect the two nodes Thm: Opt tree guaranteed if distances match a tree 35 Does CG © Ron notShamir assume a clock CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS D C B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 12 21 C

- 14 18 27 D

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS D C B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 12 21 C

- 14 18 27 D

A i B C D

( ( ( ( 17+ 17+ 21+ 27+ 21+ 12+ 12+ 18+ 27)/ 18)/ 14)/ 14)/ u

i

2 2 2 2 = = = = 32 23 23 29 . . . . 5 5 5 5

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS D C B A D C B A

- A - A

- 17 B

- - B D

39

Neighbor Joining Algorithm Joining Neighbor ij - 12 21 C

- u i - - - C - - 14 18 27 D

35 35

u

j

- - - - D

39 35 35

A i B C D

( ( ( ( 17+ 17+ 21+ 27+ 21+ 12+ 12+ 18+ 27)/ 18)/ 14)/ 14)/ u

i

2 2 2 2 = = = = 32 23 23 29 . . . . 5 5 5 5

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS D C B A D C B A

- A - A

- 17 B

- - B D

39

Neighbor Joining Algorithm Joining Neighbor ij - 12 21 C

- u i - - - C - - 14 18 27 D

35 35

u

j

- - - - D

39 35 35

A i B C D

( ( ( ( 17+ 17+ 21+ 27+ 21+ 12+ 12+ 18+ 27)/ 18)/ 14)/ 14)/ u

i

2 2 2 2 = = = = 32 23 23 29 . . . . 5 5 5 5

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS D C B A D C B A

- A - A

- 17 B

- - B D

39

Neighbor Joining Algorithm Joining Neighbor ij - 12 21 C

- u i - - - C - - 14 18 27 D

35 35

u

j

- - - - D

39 35 35

A i B C D C

( ( ( ( X 17+ 17+ 21+ 27+

21+ 12+ 12+ 18+ D

27)/ 18)/ 14)/ 14)/ u

i

2 2 2 2 = = = = 32 23 23 29 . . . . 5 5 5 5

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS D C B A D C B A

- A - A

- 17 B

- - B D

39

Neighbor Joining Algorithm Joining Neighbor ij - 12 21 C

- u i - - - C - - 14 18 27 D

35 35

u

j

- - - - D

39 35 35

v v D C

= = 0 0 . . A i B C D 5 5 C

4 x x

( ( ( ( 14 14 X 17+ 17+ 21+ 27+

10 + + 21+ 12+ 12+ 18+ D

0 0

. . 5 5 27)/ 18)/ 14)/ 14)/ x ( x ( u

i

29 23 2 2 2 2 = = = = . . 32 23 23 29 5 5 - - 23 29 . . . . 5 5 5 5 . .

5 5 = ) = ) 10 4

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X D C B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 12 21 C

- 14 18 27 D

- X

C

4

X

10 D

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X D C B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 12 21 C

- 14 18 27 D

- X

D

D C XB XA

4

= = = ( = ( D = ( D = ( X 8 17

12 21 10

CB CA D

+ +

+ D + D 18 27 DB DA

- -

- - 14 14

D D )/ )/ CD CD )/ )/ 2 2

2 2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X D C B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 12 21 C

- 14 18 27 D

- 8 17 X

D

D C XB XA

4

= = = ( = ( D = ( D = ( X 8 17

12 21 10

CB CA D

+ +

+ D + D 18 27 DB DA

- -

- - 14 14

D D )/ )/ CD CD )/ )/ 2 2

2 2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 8 17 X

D

D C XB XA

4

= = = ( = ( D = ( D = ( X 8 17

12 21 10

CB CA D

+ +

+ D + D 18 27 DB DA

- -

- - 14 14

D D )/ )/ CD CD )/ )/ 2 2

2 2

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 8 17 X

A i B X C

4

( ( ( X 17+ 17+ 17+

10 17)/ 8 8 D

)/ )/

1 1 1 u

= = i

= 25 25 34

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X B A X B A

- A - A

D - 17 B ij

- - - B u

42 i Neighbor Joining Algorithm Joining Neighbor - 8 17 X -

u j

- - - X 28 28

A i B X C

4

( ( ( X 17+ 17+ 17+

10 17)/ 8 8 D

)/ )/

1 1 1 u

= = i

= 25 25 34

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X B A X B A

- A - A

D - 17 B ij

- - - B u

42 i Neighbor Joining Algorithm Joining Neighbor - 8 17 X -

u j

- - - X 28 28

A i B X C

4

( ( ( X 17+ 17+ 17+

10 17)/ 8 8 D

)/ )/

1 1 1 u

= = i

= 25 25 34

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS X B A X B A

- A - A

D - 17 B ij

- - - B u

42

i Neighbor Joining Algorithm Joining Neighbor - 8 17 X -

u j

- - - X 28 28

v v D A

= = 0 0 . . A i B X 5 5 C

4 x x

( ( ( 17 17 X 17+ 17+ 17+

10 + + 17)/ 8 8 D

0 0 )/ )/

. . 5 5 1 1 1 x ( x ( u

= = i

= 25 34 25 25 A 13 34

- -

34 25

Y = ) = )

4

B 4 13

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Y X B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 8 17 X

Y

C

4

X

10 D

A 13

Y

4

B

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Y X B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 8 17 X

4 Y

D

C YX

4

= = ( D = ( X 4

17 10 AX D

+

+ D 8 BX -

17 -

A 13 )/ D

AB

2 )/

Y

2

4

B

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Y X

- X

- 4 Y

Neighbor Joining Algorithm Joining Neighbor D

C YX

4

= = ( D = ( X 4

17 10 AX D

+

+ D 8 BX -

17 -

A 13 )/ D

AB

2 )/

Y

2

4

B

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Y X

- X

- 4 Y

Neighbor Joining Algorithm Joining Neighbor

D

C YX

4

= = ( D = ( 4

17 10 AX D

+

+ D 8 4 BX -

17 -

A 13 )/ D

AB

2 )/

2

4

B

CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS D C B A

- A

- 17 B

Neighbor Joining Algorithm Joining Neighbor - 12 21 C

- 14 18 27 D

D

10 C

4

4

4

B 13

A

Naruya Saitou

• Division of Population Genetics, National Institute of Genetics & Department of Genetics, Graduate University for Advanced Studies (Sokendai) Mishima, 411-8540, Japan Probabilistic approaches

56 CG © Ron Shamir Likelihood of a Tree

• Given: – n aligned sequences M= X1,…,Xn – A tree T, leaves labeled with X1,…,Xn • reconstruction t: – labeling of internal nodes – branch lengths • Goal: Find optimal reconstruction t* : One maximizing the likelihood P(M|T,t*)

57 CG © Ron Shamir Likelihood (2)

• We need a model for computing P(M|T,t*) • Assumptions: – Each character is independent – The branching is a Markov process: • The probability that a node x has a specific label is only a function of the parent node y and the branch length t between them. • The probabilities P(x|y,t) are known

58 CG © Ron Shamir Example

x5 t5 x4 t t3 t1 2 x2 x1 x3

P(x1,..., x5 |T,t*) = 1 4 2 4 3 5 4 5 5 P(x | x ,t1)P(x | x ,t2 )P(x | x ,t3 )P(x | x ,t4 )P(x )

59 CG © Ron Shamir Calculating the Likelihood

Assume that the branch lengths tuv are known.   P(M | T, t*) = ∏  ∑ P(M • j , R |T,t*) character j reconstrcuction R     =   ∏  ∑  P(root) ∏ pu→v (tuv ) character j reconstrcuction R  edge u→v 

Independence Markov property of sites independence of each branch 60 CG © Ron Shamir Additional Assumed Properties Additivity:

∑ P(a | b,t)P(b | c, s) = P(a | c, s + t) b Reversibility (symmetry):

qb P(a | b,t) = qa P(b | a,t)

Allows one to freely move the root

Provable under broad and reasonable assumptions 61 CG © Ron Shamir

Jukes-Cantor Model (J-K ’69) • Assumptions: – Each base in sequence has equal chance of changing – Changes to other 3 bases with equal probability • Characteristics – Each base appears with equal frequency in DNA – The quantity r is the rate of change – NOTE: r is not the rate of mutation, it is the rate of substitution event of a nucleotide throughout a population. – During each infinitesimal time ∆t a substitution occurs with probability 3r∆t 62 CG © Ron Shamir

a=r 63 CG © Ron Shamir Jukes-Cantor Model

• PA(t) : prob of A in the character at time t • Discrete case: Prob. change AB in one time unit is r

• PA(t+1) = (1-3r) PA(t) + r (1- PA(t))

• PA(t+1) - PA(t) = -3r PA(t) + r (1- PA(t)) • Continuous case:

• ∆PA(t) = -3r PA(t) + r (1- PA(t))= -4r PA(t) + r • Solution to the differential equation: 1 P = (1+ 3e−4rt ) A(t) 4 64 CG © Ron Shamir • prob. that the nucleotide remains unchanged 1 3 −4rt over t time units Psame = + e : 4 4 • Probability of specific change: 1 1 −4rt P → = − e A B 4 4 • Probability of change:

3 3 −4rt Pchange = − e 4 4 • Note: For 3 t → ∞ P → change 4 65 CG © Ron Shamir Other Models

• Kimura’s 2-parameter model: – A,G - purines; C,T - pyrmidines – Two different rates • purine-purine or pyrmidine-pyrimidine (transitions) • purine-pyrmidine or pyrmidine-purine (transversions)

• Felsenstein ‘84 and Yano, Hasegawa & Kishino ’85 extend the Kimura model to asymmetric base frequencies.

66 A G CG © Ron Shamir C T Charles Cantor

Boston University Professor, Biomedical Engineering Professor of Pharmacology, School of Medicine Center for Advanced Biotechnology Ph.D., Biophysical Chemistry, University of California, Berkeley

Dr. Cantor is currently Chief Scientific Officer at Sequenom, Inc in San Diego, California. 67 CG © Ron Shamir Efficient Likelihood Calculation (Felsenstein ’73) Use dynamic programming

DefineSj(a,v) = Pr(subtree rooted in v | vj = a) Initialization:

∀ leafv set Sj(a,v) = 1 if v is labeled by a, else Sj(a,v) = 0 Recursion: Traverse the tree in postorder: for each node v with children u and w, for each state x

   S (x,v) =  S (y,u) p (t ) S (y, w) p (t ) Complexity: j ∑ j x→ y vu ∑ j x→ y vw  2  y  y  O(nmk )   n species, Final Soln: L = ∏∑ S j (x,root)P(x) m chars, j  x  k states 68 CG © Ron Shamir

69 CG © Ron Shamir Finding Optimal Branch lengths

Optimizing the length of a single root-child branch r-v can be done using standard optimization techniques m   =  ⋅ ⋅ ⋅  log L ∑ log∑ p(x) S" j (x,r) px→ y (trv ) S j (y,v) j=1  x, y  • Under the symmetry assumption, each node can be made (temporarily) the root • To heuristically optimize all the branch lengths: repeatedly optimize one branch at a time

– No guaranteed convergence, but often works 70 CG © Ron Shamir

Character Based Methods

71 CG © Ron Shamir Inferring a Phylogenetic Tree Generic problem: Optimal Phylogenetic Tree: • Input: A: CAGGTA – n species, B: CAGACA C – set of characters, : CGGGTA D – for each species, the state of each : TGCACT E of the characters. : TGCGTA – (parameters) • Goal: find a fully-labeled phylogenetic tree that best explains the data. (maximizes a target function).

Assumptions: A B C D E

• characters - mutually independent 72 CG• ©species Ron Shamir evolve independently A Simple Example

• Five species, three have ‘C’ and two ‘T’ at a specified position. • A minimal tree has one evolutionary change:

C T C T C C C T T ↔ C 73 CG © Ron Shamir Inferring a Phylogenetic Tree Naive Solution - Enumeration: • No. of non-isomorphic, labeled, binary, unrooted

trees, containing n leaves: (2n -5)!! = Πi=3…n(2i-5) • Rooted: x(2n-3)  for n=20 this n leaves is 1021 ! • n-2 internal nodes • 2n-3 edges Pf: induction on n

a b

a Each new species b c a c b b c a

adds 2 new edges Adding 3rd species 74 CG © Ron Shamir Parsimony

• Goal: explain data with min. no. of evolutionary changes (“mutations”, or mismatches)

• Parsimony: S(T) ≡ Σ j Σ{v, u}∈E(T) |{j: vj≠uj}|

• “Small parsimony problem”: – Input: leaf sequences + a leaf-labeled tree T – Goal: Find ancestral sequences implying minimum no. of changes (most parsimonious) • “Large parsimony problem”: – Input: leaf sequences – Gaol: Find a most parsimonious tree (topology, leaf labeling and internal seqs.) 75 CG © Ron Shamir Algorithm for the Small Parsimony Problem (Fitch `71)

• Initialization: scan T in post-order, assign:

– leaf vertex m: Sm={state at node m} – internal node m with children l, r : cost = no. Sm= Sl∪Sr if Sl∩Sr=φ (i) of case (i) events. Sl∩Sr o/w (ii) • Solution Construction: scan tree in preorder, choose:

– for the root choose x∈Sroot time: O(n·k) – at node m with child k: (k = #states s) pick same state as k if possible; o/w - pick arbitrarily 76 CG © Ron Shamir Fitch’s Alg

T

T,C T T−>C C A,T T−>Α T C C A T T

77 CG © Ron Shamir Walter Fitch May 21, 1929 - March 10, 2011

• One of the most influential evolutionary biologists in the world, who established a new scientific field: molecular phylogenetics. He was a member in the National Academy of Sciences, the American Academy of Arts and Sciences and the American Philosophical Society. He co-founded and was the first president of the Society for Molecular Biology, which established the annually awarded Fitch Prize. Additionally, he was a founding editor of Molecular Biology and Evolution. • Fitch was at the University of California, Irvine, until his death, preceded by three years at the University of Southern California and 24 years University of Wisconsin– Madison. 78 CG © Ron Shamir Weighted Parsimony

k states S1 ,…, Sk Cij = cost of changing from state i to j Algorithm (Sankoff-Cedergren ‘88):

• Need: Sk(x) – best cost for the subtree rooted at x if state at x is k • For leaf x,

Sk(x) = 0 if state of x is k ∞ o/w • Scan T in postorder. At node a with children l, r

• Sk(a) = minm(Sm(l)+Cmk) + minm(Sm(r)+Cmk)

• Opt=minm(Sm(root)) 2 time: O(n·k ) 79 CG © Ron Shamir

C G

C

C

80 CG © Ron Shamir Over the past 30 years, Sankoff formulated and contributed to many of the fundamental problems in . In sequence comparison, he introduced the quadratic version of the David Needleman-Wunsch algorithm, developed the first statistical test for alignments, initiated the study of the limit behavior of random Sankoff sequences with Vaclav Chvatal and described the multiple alignment problem, based on minimum evolution over a phylogenetic tree. In the study of RNA secondary structure, he developed algorithms based on general energy functions for multiple loops and for simultaneous folding and alignment, and performed the earliest studies of parametric folding and automated phylogenetic filtering. Sankoff and Robert Cedergren collaborated on the first studies of the evolution of the genetic code based on tRNA sequences. His contributions to phylogenetics include early models for horizontal transfer, a general approach for optimizing the nodes of a given tree, a method for rapid bootstrap calculations, a generalization of the nearest neighbor interchange heuristic, various constraint, consensus and supertree problems, the computational complexity of several phylogeny problems with William Day, and a general technique for phylogenetic invariants with Vincent Ferretti. Over the last fifteen years he has focused on the evolution of genomes as the result of chromosomal rearrangement processes. Here he introduced the computational analysis of genomic edit distances, including parametric versions, the distribution of gene numbers in conserved segments in a random model with Joseph Nadeau, phylogeny based on gene order with Mathieu Blanchette and David Bryant, generalizations to include multi- gene families, including algorithms for analyzing genome duplication and hybridization with Nadia El-Mabrouk, and the statistical analysis of gene clusters with Dannie Durand. Sankoff is also well known in linguistics for his methods of studying grammatical variation and 81 CG © Ron Shamir change in speech communities, the quantification of discourse analysis and production models of bilingual speech. Large Parsimony Problem

Input: n x m matrix M: th • Mij = state of j character of species i. • Mi· = label of i (all labels are distinct)

Goal: Construct a phylogenetic tree T with n leaves and a label for each node, s.t. • 1-1 correspondence of leaves and labels • cost of tree is minimum.

• NP-hard 82 CG © Ron Shamir Branch & Bound (Hendy-Penny ‘89)

• enumerate all unrooted trees with increasing no. of leaves Note : cost of tree with all leaves ≥ cost of subtree with some leaves pruned (and same labeling) => If cost of subtree ≥ best cost for full tree so far, then: can prune (ignore) all refinements of the subtree . enumeration & pruning can be done in O(1) time per visited subtree.

83 CG © Ron Shamir Branch swapping

C D C D

A A B B

Each internal edge defines 4 sub-trees:

Can swap two such non-adjacent sub-trees

84 CG © Ron Shamir Nearest Neighbor Interchanges • handles n-labeled trees • T and T’ are neighbors if one can get T’ by following operation on T:

S R

S R U V S U U V R V

SV|RU SU|RV SR|UV • Use the neighborhood structure on the set of solutions (all trees) via hill climbing, annealing, other heuristics... 85 CG © Ron Shamir NN interchange graph

86 CG © Ron Shamir

87 CG © Ron Shamir Wilson & Cann, Sci Amer 92 FIN

88 CG © Ron Shamir