External Review on Center for Computational Sciences

Division of Global Environment and Biological Sciences Biological Science Group

Hashimoto, T. Collaborative member of CCS (Department of Structural Biosciences, Graduate School of Life and Environmental Sciences, Univ. of Tsukuba) Foundation

Division of Global Environment and Biological Sciences Biological Science Group ー Created at Apr. 2004

Hashimoto, T. Collaborative member (Dept. Structural Biosci., Univ. of Tsukuba) Members of Biological Science group

Since Aug. 2005

• Center for Computational Sciences - Ass. Prof. Yuji Inagaki

• Dept. Structural Biosciences - Prof. Tetsuo Hashimoto (Collaborative member) - Res. Assoc. Miako Sakaguchi Research Area

• Molecular Evolution - Global phylogeny of - Methodological studies on molecular phylogenetic inference Beautiful creatures

Rhizaria Chromalveolates Plantae Excavata Opisthokonts

Amoebozoa Main Projects

• Inference on the phylogenetic relationship among large groups of eukaryotes • Inference on the root of the tree of eukaryotes • Inference on the mitochondrial evolution • Inference on the plastid evolution • Methodological studies on molecular phylogenetic inference

• Establishment of the methodology for large scale phylogenetic analysis based on the maximum likelihood (ML) approach Collaboration with other Group in CCS ‘Mol-Evol’ group:

・Biological Science Group Yuji Inagaki / T. Hashimoto

・High Performance Computing Group Prof. Mitsuhisa Sato (Director of CCS) (Graduate students) Yoshihiro Nakajima Akihiro Aida (~Mar. 2007) Collaboration with other Institutions

Univ. of Tsukuba, Bio.-Dept.

Kyoto Univ. Dr. Y. Sako

Osaka Univ. Dr. K. Tanabe Biological Juntendo Univ. Dr. T. Nara Science Group, CCS, JAMSTEC Dr. K. Takishita Tsukuba Univ. Dalhousie Univ. Dr. A. Roger

British Colombia Univ. Dr. P. Keeling MultipleSequence sequenceAlignment and alignment Data for Phylogenetic Analysis

Ribosomal protein mS30

Homo sapiens Schizosaccharomyces pombe Encephalitozoon cuniculi histolytica Plasmodium falciparum Giardia intestinalis Trichomonas vaginalis

Hs Sp Ec Eh Aligned Pf Gi Tv

Hs Sp Ec Selected sites Eh Pf Gi Tv

・Physicochemically similar aa are marked with same color ・Column is called position or site Phylogenetic Tree

A Internal node E A~F:extant species, groups, etc.

→ one of the relationships between

Internal 6 species B branch D → ((A,(B,C)),F,(D,E)) External C branch : tree topology

Branch length: substitution rate F Unit: substitutions/site Maximum Likelihood (ML) method

・Based on explicit evolutionary models, the analysis uses probability to find a tree that best accounts for the variation in a data set f (θ|data) = Pr (data |θ) → maximize = (tree topology, branch lengths, α, π, etc.)

log{ f (θ|data)} (log-likelihood) → maximize

・Calculation is performed on each column of an alignment ・All possible trees are exhaustively examined → Select the ML tree ・Robust against the violation of evolutionary rate constancy between species (sequences) ⇒ applicable for diverse sequences ・Computationally intense Number of Possible Trees

species number of tree topologies

1 1 2 1 31 43 515 6 105 7 945 8 10,395 9 135,135 10 2,027,025 ・・・ n (2n-5)! 2n-3(n-3)! Combined analysis based on Concatenate Model

Gene 1 Gene 2 Gene 3 Species 1 Species 2 Species 3 Species 4

Species 1 Species 2 Species 3 Species 4

Concatenate sequences of different genes ⇒ regard as 1 gene

ML phylogenetic analysis ⇒ a set of parameters:θ^ Combined analysis based on Separate Model

S53 S11 S52 S12 Tree i S51 S13 Gene 1 l (θ^ |X ) 5 groups, 15 species S21 1(i) 1(i) 1 S41 S42 S43 S22 S32 S31 S23 S33 S56 S11 S51 Tree i Gene 2 l (θ^ |X ) 5 groups, 13 species 2(i) 2(i) 2 S22 For each of Tree i S45 S44 S24 S41 S33 (i =1,…15) S31 calculate log-likelihood S35 S34 S32 of the gene k

Gene m l (θ^ |X ) 5 groups, x species m(i) m(i) m +

The log-likelihood of the ML tree in total m m m m ^ ^ k ^ Σ l (θ |X ) max{Σ l (θ |X }=max[Σ{Σlog f (X |θ }] k=1 k(i) k(i) k k=1 k(i) k(i) k k(i) kh k(i) i i k=1 h=1 (i=1,…,15) MPI-Puzzle – For exhaustive tree search

• Calculated lnL by Tree-Puzzle

T lnL Sequential MPI 1 T1

T2 lnLT T lnL 2 1 T1 T lnL

··· 3 ··· T T lnL 3 2 T2 T lnL 3 T3

··· ··· T lnL N+1 TN+1 Data Data T lnL T lnL N+2 TN+2 N TN

··· ··· T lnL ··· N+3 ··· TN+3

T lnL 944 T944 T lnL T lnL 2N+1 T2N+1 945 T945 T lnL 2N+2 T2N+2 T lnL ··· 2N+3 ··· T2N+3 Phylogeny of Eukaryotes – Recent results 23-gene analyses – w/o α-tubulin ~10,000 positions < Separate model > (Exhaustive search for 945 trees)

Opisthokonta ・ Recovered three eukaryotic “supergroups” – Unikont: Opisthokonta + Stramenopiles/ Amoebozoa 41 Alveolates Diplomonads/ – Excavata: Dip/Par + Parabasalids Eug/Het Euglenozoa/ 57 Heterolobosea – Plantae: G+R 70 94 Green plants Red algae Phylogeny of Eukaryotes – Recent results Multi-gene analyses – with ~8,500 positions < Separate model > (Exhaustive search for 10395 trees)

100 Cryptomonad 36 Haptophytes Red algae 75 Guillardia Pavlova 45 Glaucophytes Multiple Gene Phylogenies Green plants Support the Monophyly of Cryptomonad and Haptophyte 98 Alveolates Host Lineages. Stramenopiles Patron, Inagaki, and Keeling: Unikonts Curr. Biol. 2007 Summary of Research Activities 2005-2007 ー Biological Science Group ー

Publications Total Peer reviewed Review (In Jap.) 20 4 24

Domestic Meetings Total Oral Poster Invited 12 9 9 30

International Meetings Total Oral Poster Invited 3 3 2 8 Summary of Research Activities 2005-2007 ー Biological Science Group ー

Grants: Grant-In-Aid for Scientific Research from JSPS 2005-2006 (B) ¥14,200,000 2006-2007 (C) ¥3,600,000 Grant-In-Aid for Young Scientists from JSPS 2006-2007 (B) ¥2,600,000

Awards: Yuji Inagaki 2007 An award from the ministor of MEXT for young scientists Short Term Future Plans

• Phylogenetic relationship among large groups of Eukaryotes - Excavata monophyly? - Chromalveolates monophyly?

• Methodological studies on molecular phylogenetic inference - Imapct of the utilization of different models for combining multiple genes (‘concatenate’ or ‘separate’) Short Term Future Plans

・ Establishment of the methodology for large scale phylogenetic analysis based on the maximum likelihood (ML) approach - Exhaustive search up to 2,027,025 trees for 10 lineages Phylogenetic Analysis

Phylogenetic inference: estimation of : ・Tree topology ・Branch length

Methods of phylogenetic reconstruction:

・Distance matrix method ・Maximum parsimony method ・Maximum likelihood method Phylogenetic Analysis

Phylogenetic inference: estimation of : ・Tree topology ・Branch length

Methods of phylogenetic reconstruction:

・Distance matrix method ・Maximum parsimony method ・Maximum likelihood method Bootstrap analysis

Phylogenetic Resampling inference Pseudo sample 1 → Tree 1 Pseudo sample 2 → Tree 2 Original data set Evaluation of the Pseudo sample 3 → Tree 3 :Sample Sampling error Pseudo sample n → Tree n

Phylogenetic inference Bootstrap sample DM,MP,ML etc. % of the trees that reconstruct a given intermal branch Optimal tree ⇒ Bootstrap (BP) with BP value at each internal branch value Maximum likelihood method in brief

Example data set S1 x1 x 3 S3 (4 species, n positions) t1 i j t3 X = (Xij) (i=1,…4; j=1,…,n) t5 t2 t4 h’th position x S S2 x2 T1 4 4 Xh=(X1h , X2h , X3h , X4h ) S1 ,…, S4 : Extant species x1 ,…, x4 : data of h’th position in extant species i, j : data of h’th position in ancestral species t1 ,…, t5 : branch lengths

Assumption 1. Xt : Continuous-time stationary Markov process with a transition probability Pij(t) transition probability kto l:

Pkl( t) = P{Xt+s = l | Xs = k} Evolution (substitutions) of each branch occurs independently.

Assumption 2.Each position evolves independently. Under the Assumption 1. Using Chapmann-Kolmogorov equation, ← probability of having the data of h’th position, f (x1 , x2, x3 , x4 |θ ) given a tree topology, T1 , and a transition matrix Pij (t)

= ΣΣ P{X0 = i}P{Xt1 = x1 , Xt2 = x2 , Xt5 = j | X0 = i} i j

x P{Xt5+t3 = x3 , Xt5+t4 = x4 | Xt1 = x1 , Xt2 = x2 , Xt5 = j, X0 = i}

= Σ {πi Pix1 (t1 )Pix2 (t2) Σ Pij(t5 ) Pjx3 (t3 ) Pjx4 (t4 )} i j πi : composition of nucleotide or amino acid i

S1 x1 x 3 S3 S ,…, S : Extant species t1 i j t3 1 4 x1 ,…, x4 : data of h’th position in extant species t5 i, j : data of h’th position in ancestral species t2 t4 t1 ,…, t5 : branch lengths

S x x4 S4 2 2 T1 Under the Assumption 2. Likelihood function of θgiven a data X can be written as:

n L (θ | X) =Π f (Xh | θ ) h=1 Its log-likelihood: n l (θ | X) = Σ log f (Xh | θ ) h=1

θ = ( t1, … , t5 ) : branch lengths ^ θ can be estimated by maximizing the log-likelihood function, l l (θ^ | X) = max {l (θ | X) : θ^ ∈ Θ } ^ By comparing the log-likelihood ( l (θ| X) ) values between the different tree

topologies, T1 , T2 , and T3 , a tree with the highest log-likelihood value is selected as the ML tree. Evolutionary model

Parameters

-Tree topology, T branch lengths (t1, …, t5 )

-Transition Probability those included in Pij( t)

-Composition πi -Among site-rate heterogeneity α (shape parameter of Γ distribution)

Pij( t) : for nucleotide substitutions : for amino acid substitutions Jukes-Cantor Poisson Kimura-2-parameter PAM HKY85 JTT TN93 WAG …….. …… Bootstrap resampling

Original data set Cryptosporidium ACTAACTGAG : 4 × 10 matrix Toxoplasma ACTATGAGAG Theileria GCTGTGAAAA Plasmodium GCTGTGATCA 0123456789 Randomly resample 10 positions out of 10 positions in the original data set

Crypto CACTGAAGAA AACGTAATAC GGCTAATGAG ..... Toxo CTGTGAAGTA TACGAAATAC GGGTATAGAG ..... Theil CTGTAAGATG TGCAAGATAC AAGTGTAAAA ..... Plasmo CTGTACGTTG TGCAAGCTCC ATGTGTATCA ..... 1452983740 4019608281 9752346789 Pseudo sample 1 Pseudo sample 2 Pseudo sample 3 Crypto- sporidium Plasmodium ML tree of the 70 original data set Theileria Toxoplasma 0.1substitutions /position

Bootstrap sample 1 Bootstrap sample 2 Bootstrap sample 3 Bootstrap sample 4 Bootstrap sample 5 ML tree ML tree ML tree ML tree ML tree Plasmo Plasmo Toxo Plasmo Plasmo Plasmo Crypto Crypto Crypto Crypto Theil Theil Crypto Theil Theil Theil Toxo Toxo Toxo Toxo

Bootstrap sample 6 Bootstrap sample 7 Bootstrap sample 8 Bootstrap sample 9 Bootstrap sample 10 ML tree ML tree ML tree ML tree ML tree Plasmo Plasmo Plasmo Plasmo Plasmo Crypto Crypto Crypto Crypto Crypto

Theil Theil Theil Theil Theil Toxo Toxo Toxo Toxo Toxo Classical SSUrRNA tree Eukaryota * (~mid 90’s) Enatamoeba

Trypanosoma

Naegleria *

* Trichomonas

Giardia

*Amitochondriate group

Glugea SSUrRNA tree of Eukaryota Presence of large (~mid 90’s) Gruop monophyletic groups Metazoa Fungi Viridiplantae The branching order among Rhodophyta Endosymbiosis of these groups was misleading! proto-mitochondria stramenopiles Alveolata ? Pelobionta* Because, long branch * attraction (LBA) made Eukaryota Heterolobosea the SSUrRNA tree misleading. × Euglenozoa Diplomonadida* Parabasalia* LBA― Most serious problem Microsporidia* in phylogenetic analysis Prokaryota (Outgroup) * Amitochondriate protist group

Long Branch Attraction (LBA) artefact :the most serious problem in phylogenetic analyses

Outgroup Outgroup

Real tree Inferred tree based on the Based on this tree, sequence data simulated sequence data at external nods are simulated. Long branches are artificially located at the base of the tree SSUrRNA tree of Eukaryota (~mid 90’s) (late 90’s) Gruop

Metazoa Metazoa Fungi Fungi Viridiplantae Viridiplantae Lobosa Lobosa Rhodophyta Rhodophyta Endosymbiosis of proto-mitochondria stramenopiles Stramenopiles Alveolata ? Alveolata Pelobionta* ×? Pelobionta* Mycetozoa Mycetozoa Entamoebidae* Entamoebidae* Eukaryota Heterolobosea Heterolobosea × Euglenozoa Euglenozoa Diplomonadida* Diplomonadida* Parabasalia* Parabasalia* Microsporidia* Microsporidia* Prokaryota (Outgroup) * Amitochondriate protist group No consensus on eukaryotic phylogeny !

Combined phylogeny Large group ⇒ Supergroup

~) Metazoa of multi-genes (2000 Fungi Opisthokonta Microsporidia* - To accumulate phylogenetic Lobosea Mycetozoa Amoebozoa signals of individual genes Entamoebidae*

Pelobionta* × Viridiplantae Plantae Rhodophyta Stramenopiles ・ Robust inference Alveolata ・ Euglenozoa ・ Heterolobosea Large groups Diplomonadida* ⇒ Supergroup Parabasalia* Outgroup

* Amitochondriate group