SEARCHING FOR THE ROSETTA STONES IN THE MULTIFUNCTIONAL PROTEINS OF THE PHYTOPHTHORA SOJAE GENOME

Thomas M. Wittenschlaeger II

A Thesis

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

August 2007

Committee:

Paul F Morris, Advisor

Scott Rogers

Alexei Fedorov

ii

ABSTRACT

Paul F. Morris, Advisor

Defining protein-proteins interactions is a major challenge in newly sequenced genomes.

Novel multifunctional proteins in oomycete genomes may be useful in assembling both metabolic and regulatory pathways. A global survey of the oomycete Phytophthora sojae genome, identified 274 novel multifunctional proteins using strict criteria that excluded multi- exonic gene models without EST support. These P. sojae proteins have significant BLAST hits to two or more different proteins. Such proteins have been posited as Rosetta stones, since their association in one genome has been used to infer the association of orthologous proteins in other genomes. In our analysis, we adopted the reciprocal smallest distance algorithm (Wall et al 2003) to identify potential orthologs in 34 sequenced genomes. Surprisingly, this approach identified only ten potential Rosetta stones, where each domain of a multifunctional protein had an ortholog in the same organism. The evolutionary history of oomycetes involved the endosymbiotic acquisition of a red algae, and subsequent transfer of nuclear and plastid genes to the host nucleus. We postulate that this endosymbiotic event (genome acquisition and recombination) has enabled the ancestral genome to develop metabolic and regulatory pathways that are distinct from those of the animal, fungal, and plant genomes. A separate phylogenetic analysis of four metabolic multifunctional proteins suggested in each case that the domains within these multifunctional proteins were more closely related to proteins from different phylogenetic groups. This suggests that such proteins arose from fusion events of proteins acquired by horizontal transfer from the algal endosymbiont or bacterial genomes. Our iii observations suggest that this evolutionary strategy of genome acquisition and recombination should also be assessed in other members of the Chromalveolates. iv

ACKNOWLEDGMENTS

I would like to thank my advisor Dr. Paul F. Morris for his support and assistance with the writing of this thesis. Also, without his knowledge of biochemistry and genetics and his preparation of multifunctional proteins, I never would have had the opportunity to work on this project.

I would also like to thank my committee members Dr. Scott Rogers and Dr. Alexei

Fedorov.

Thank you to Korinna Straube for her help and skill in the construction of the tables in this work.

Finally, I would like to thank my mom, my grandparents, and my brother for their unwavering support through the years v

TABLE OF CONTENTS

Page

INTRODUCTION...... 1

MATERIALS AND METHODS...... 3

RESULTS ...... 6

DISCUSSION………………...... 9

REFERENCES………………...... 11

APPENDIX………………...... 13

vi

LIST OF FIGURES/TABLES

Figure/Table Page

1 Table 1 ...... 14

2 Table 2 ...... 23

3 Figure 1 ...... 25

4 Figure 2 ...... 26

5 Figure 3 ...... 27

6 Figure 4 ...... 28

7 Figure 5 ...... 29

9 Figure 6 ...... 30

10 Figure 7 ...... 31

11 Figure 8 ...... 33 1

INTRODUCTION

In the United States, infection of soybean crops by Phytophthora sojae has an economic impact of millions of dollars per year and is the second most important pathogen of soybeans [1].

Evidence of P. sojae’s destructive power can be seen as bare spots in low lying areas of soybean fields in the early spring. Other Phytophthora species are major pathogens of a wide range of mostly dicotyledonous plants [2]. A related species, P. ramorum, causative agent of sudden oak death, threatens the major tree species on the California coastal range. P. infestans was the precipitator of the Irish potato famine and remains the most important world wide pathogen of potatoes today. [3].

Phytophthora species belong to the kingdom Stramenopila, whose organisms belong, in turn, to a larger class of organisms known as the chromalveolates [4]. In addition to

Phytophthora, the Stramenopiles include golden brown algae, diatoms, and brown algae. The algal Stramenopiles are photosynthetic having acquired a plastid in a secondary endosymbiotic event, that is, the acquisition of a red algae by a common ancestor [5]. Phytophthora are also descended from this common ancestor but, unlike algal Stramenopiles, P. sojae does not contain a plastid but instead seems to have retained the vestiges of its plant past by incorporating plant genes into the genome while losing the plant’s signature plastid [6].

Analysis of the recently sequenced P sojae and P ramorum genomes revealed the presence of several novel multifunctional proteins. Here novel multifunctional proteins are defined as proteins with two or more domains and are not found in organisms from other phylogenetic groups. Multifunctional proteins may be multimers of repeated domains, fused domains that define metabolic pathways, or components of regulatory pathways. It has been observed that proteins composed of two domains in one organism are often found in other 2 organisms as two separate interacting proteins [7]. Because of this, multifunctional proteins, termed Rosetta stone proteins, in one organism may be useful in predicting protein-protein interactions in other organisms [8]. This can be taken one step further as the presence of these multifunctional proteins can be used to infer the conservation of pathways between organisms. If a multifunctional protein in one organism has an ortholog protein in the form of another multifunctional protein or group of interacting proteins in another organism, then it can be inferred the organisms share a common pathway. Furthermore, the opposite can infer the organisms do not share a common pathway.

How did it come to happen that there is a single multifunctional protein in one organism whose ortholog is to two interacting proteins in another organism? There are two obvious explanations. One is that two interacting proteins fused together under evolutionary pressure to form a more efficient link in some pathway. The second is that a multifunctional protein evolved as a link in some pathway and over time the functional units were released from each other, perhaps to perform some function in other pathways. The first seems the most likely and, should it be true, presents us with an interesting question [9]. Given the evolutionary history of P. sojae, is there evidence that any of the multifunctional proteins arose from a fusion event of an algal gene with a gene from the ancestral protist genome? That is, did evolution force proteins of different phylogenetic origins into a single multifunctional protein?

Here we investigate the evolutionary origin of P. sojae’s pathways and multifunctional proteins.

If P. sojae’s pathways are unique, then we should expect that its multifunctional proteins will have no homologs to proteins in another organism, save other Phytophthora. Further, we test the hypothesis that if P. sojae’s ancestor engulfed a red algae, then its multifunctional proteins are the result of a fusion between the original genes already present in its ancestor and plant genes. 3

MATERIALS AND METHODS

Protein sequences for P. patens (35938 entries), P. sojae (19824 entries), and P. trichocarpa (45555 entries) were downloaded from the DOE Joint Genome Initiative website.

Protein sequences for A. fumigatus (9906 entries), A. gossypii (4720 entries), C. briggsae (13192 entries), C. elegans (22499 entries), C. glabrata (5180 entries), C. immitis (10450 entries), C. neoformans (6439 entries), D. hansenii (6311 entries), E. coli (4319 entries), E. cuniculi (1909 entries), F. alni (6711 entries), G. zeae (11638 entries), K. lactis (5326 entries), N. crassa (10076 entries), P. entomophila (5126 entries), P. nodorum (16481 entries), P. yoelli (7755 entries), S. cerevisiae (6199 entries), S. coelicolor (8038 entries), S. pombe (4968 entries), V. eiseniae (4911 entries), and Y. lipolytica (6525 entries) were downloaded from the Swissprot website. Protein sequences for C. tepidum (2245 entries), S. CC9311 (2892 entries), and T. thermophila (27430 entries) were downloaded from the Institute for Genomic Research page. Protein sequences for

A. PCC7421 (4911 entries), G. violaceus (4431 entries), T. elongatus (2477 entries) were downloaded from cyanobase. Protein sequences for A. gambiae (15135 entries), A. thaliana

(33862 entries), B. floridanus (583 entries), B. pennsylvanicus (610 entries), B. rerio (11863 entries), D. aromatica (4155 entries), D. discoideum (13049 entries), D. melanogaster (16282 entries), G. gallus (5315 entries), G. theta (598 entries), H. sapiens (37891 entries), L. major

(8010 entries), M. musculus (32849 entries), P. luteolum (2078 entries), P. falciparum (5411 entries), P. tetraurelia (463 entries), R. norvegicus (11812 entries), S. ruber (2812 entries), and

T. nigroviridis (27810 entries) were downloaded from the Integr8 website. The M. trunculata

(27898 entries) proteome was downloaded from the Medicago truncatula webpage. The C. merolae (5014 entries) proteome was downloaded from the Cyanidioschyzon merolae webpage.

The C. hominis (3934 entries) proteome was downloaded from the Cryptosporidium hominis 4 webpage. The O. sativa (62827 entries) proteome was downloaded from the rice annotation project database website.

To be considered a multifunctional candidate, a protein must have met three criteria: 1. the protein must contain at least two functional domains 2. Any BLAST hits from the protein must not exceed the length of the protein and 3. the protein’s nucleotide sequence must span a single reading frame, that is, contain no introns. Analysis of protein domains was accomplished using the European Molecular Biology Laboratory's Interproscan [10] software package. This package includes protein signature recognition software and member databases. A standard

BLAST procedure was executed to gather data for criteria 2 [11]. Once predicted proteins were filtered by criteria one and two, remaining proteins were examined on the JGI genome browser.

The browser supplies information about a predicted protein such as predicted introns in the sequence. This feature was used to further eliminate proteins that contained introns. Exceptions were made for proteins whose predicted introns regions seemed to be incorrect; example, sequence regions that had EST support or intron lengths that are a multiple of three. Computation for identification of multifunctional candidates was performed on the Beowulf cluster at the

University of Toronto.

Multifunctional candidate sequences were split into halves, each half represented by a functional domain. The reciprocal smallest distance algorithm was utilized to find orthologs for each half and for the unsplit sequence to thirty-four proteomes. The proteomes used were A. fumigatus, A. gossypii, C. briggsae, C. elegans, C. glabrata, C. neoformans, D. hansenii, E. cuniculi, G. zeae, K. lactis, P. yoelli, S. cerevisiae, S. pombe, Y. lipolytica, A. gambiae, A. thaliana, B. floridanus, B. pennsylvanicus, B. rerio, D. aromatica, D. discoideum, D. 5 melanogaster, G. gallus, G. theta, H. sapiens, L. major, M. musculus, P. luteolum, P. falciparum,

P. tetraurelia, R. norvegicus, S. ruber, T. nigroviridis, O. sativa [12].

Multifunctional candidate sequences were again split into their functional halves. Homologous sequences for each half to twenty-nine different species A. PCC7421, A. thaliana, C. elegans, C. hominis, C. immitis, C. merolae, C. tepidum, D. melanogaster, E. coli, F. alni, G. violaceus, H. sapiens, M. musculus, M. trunculata, N. crassa, O. sativa, P. entomophila, P. falciparum, P. nodorum, P. patens, P. trichocarpa, S. CC9311, S. cerevisiae, S. coelicolor, T. elongates, T. nigroviridis, T. thermophila, V. eiseniae, were found by getting the top five hits from BLAST

Analysis and aligning those hits with the P. sojae sequence. The twenty-nine species represent seven kingdoms. The top alignments from the species were then aligned with the top alignments from the other species in ClustalW again with the P. sojae half. A phylogenetic tree was created with the Phylip software package by the Protein Sequence Parsimony Method with bootstrap support using 1000 replicates [13]. 6

RESULTS

Manual annotation has thus far identified 281 novel multifunctional proteins. An additional 130 predicted novel multifunctional proteins were excluded because they were multiexonic models that are presently without EST support and may represent two separate proteins that were fused into a single protein by the gene prediction programs. The total number of novel multifunctional proteins will also increase as refinement to gene prediction software and additional EST data identifies adjacent genes on the genome as single gene models. Based on visual inspection of domain arrangement and order, 47 gene models appear to be multimeric proteins, usually heterodimers, but with a few examples of trimers or larger multimeric proteins.

The largest family of homologous multifunctional proteins consists of nine tetrameric calcium- activated potassium channels. Fifty seven multifunctional proteins are predicted to be involved in metabolic pathways. Thus the majority of novel multifunctional proteins have functional domains that are typically associated with proteins in signaling pathways.

Approximately forty-six percent of the P. sojae multifunctional candidates (132) had no ortholog to other species. Another fifty had an ortholog for only one half of the P. sojae multifunctional protein (Table 1). Ten had orthologs to only the complete sequence. Twenty-two had orthologs to both halves and the unsplit sequence. Seven had orthologs to only both halves of the multifunctional protein sequence. The rest were a mix of multifunctional proteins that had orthologs to only one half and the complete sequence. Of the twenty-nine proteins that had orthologs to both halves of the multifunctional protein sequence, only ten were suitable for predicting protein-protein interactions (Table 2). Each of these proteins identified at least two potentially interacting proteins in another species. The majority of proteins that were predicted to interact were identified from animal or fungal genomes. Four of the rosetta stones identified in 7 this analysis predict the formation of heterodimeric or multimeric proteins with similar domains.

Presently, none of the candidate proteins listed in Table 2 have been shown to interact with each other.

Three metabolic bifunctional proteins that had homologs in at least one other kingdom were also included in the analysis to determine if they shared a common origin with their homologs. The P. sojae protein Ps133120 was included in these tests because it catalyzes two steps in the lysine biosynthetic pathway. Fungal and animal orthologs to the amino transferase domain were identified by RSD analysis but only bacterial orthologs were identified with dihyrdrodipicolinate reductase activity. Ps135354 is a bifunctional protein with orotidine 5’ phosphate decarboxylase and Orotate phosphoribosyl transferase activity catalyzing two steps in pyrimidine biosynthetic pathway. Homologues are present in both Leishmania species,

Trypanosoma species, and Parabodo caudatus [14]. Leishmania and Plasmodium orthologs were identified using the first domain, but the best orthologs to the second domain were from fungal genomes. Ps109321 has thymidylate synthase and dihydrofolate reductase activities and bifunctional homologs are present in plants. RSD analysis identified orthologs for the first domain in Arabidopsis but not the second, and animal and fungal orthologs were identified in several species for both domains. Ps112102 is a trifunctional with adenylsulfate kinase,

ATP sulfurylase and inorganic pyrophosphatase activities. It catalyzes the first two enzymatic steps in sulfate assimilation, while also degrading pyrophosphate which is a secondary product of the first enzymatic reaction. Homologues to this bifunctional protein are present in animals but not fungi. Fungal and animal orthologs were identified to the first domain, but no orthologs were associated with the ATP sulfurylase activity. 8

To determine the probable origin of functional domains within multifunctional proteins, consensus trees using Phylip were generated using homologous sequences from animal, fungal, plant, lower eukaryote and selected bacterial genomes. Phylogenetic analysis suggests that several of these bifunctional proteins contain sequences of different phylogenetic origin. The first domain of Ps133120 clusters in a node with sequences from animal and fungal genomes and also a sequence from C. merolae. Analysis of the phylogenetic relationship of the second domain suggests that it is not closely related to other eukaryotic proteins. The orotidine 5’ phosphate decarboxylase domain of Ps135354 clusters independently from other eukaryotic genomes. The transferase domain clusters with the proteins from P. faliciparum and a bacterial sequence. This analysis is consistent with the conclusions of Makiuchi et al. 2007 who concluded that the bifunctional proteins of kinetoplastids and stramenopiles represented separate fusion events [14].

Examination of both trees for Ps109321 does not support the monophyletic origin of the bifunctional proteins in P sojae and plants. The first domain clusters independently from other sequences and the second domain clusters away from plant proteins and is associated with a node containing both bacterial and opistokont proteins. Phylogenetic analysis of the adenylsulfate kinase domain of Ps112102 revealed that it clustered in a node containing both plant and animal domains. The ATP sulfurylase domain clusters in a node that includes both C. meroloae and plant proteins.

Phylogenetic analysis of the domains of the entire set of multifunctional proteins from P sojae is presently underway. 9

DISCUSSION

Annotation of the first two sequenced Phytophthora genomes has revealed two novel features. The first is evidence of significant introgression of genes from the algal endosymbiont of the last universal common ancestor. The second feature, loss of introns and prevalence of novel multifunctional proteins, is the focus of this study. The set of multifunctional genes chosen for analysis in this study was defined using conservative criteria, because a robust set of candidate Rosetta stones was thought to be more useful. The total number of novel multifunctional proteins will increase with time due to refinements in genome assembly, new

EST data, and refinements to gene prediction programs. However, given the low frequency of candidate protein interactors returned by this analysis, novel multifunctional proteins will likely be most useful as nodes defining novel regulatory pathways (see discussion below).

The most promising Rosetta stone is Ps131776 which has interPro hits to domains with fatty Acyl CoA reductase and glycerol 3 phosphate acyltransferase acitivity. RSD analysis identified candidate interactors in several animal species. The human genes Q9H600 (MLSTD2) and Q5TBH6 (GNPAT) are both located in the peroxisome [15]. Predicted interactors for three

Rosetta stones have been identified in Drosophila melanogaster. These genes will be included in future yeast two hybrid assays conducted by the Finley lab at Wayne State University http://proteome.wayne.edu/PIMdb.html.

Gene duplication events, whether selective expansion of gene families or whole genome duplication events, have been identified in animal, fungal, and plant lineages [16, 17, 18, 19].

Because such events are common to all three kingdoms and in some cases have been tied to the base of specific lineage expansions (vertebrates and rayfin fishes) it is tempting to ascribe the diversity of these groups to these genetic events. In contrast no whole genome duplication events 10 have been identified in the oomycete genomes. However, phylogenetic evidence suggests that almost 5% of the genome originated from the photosynthetic endosymbiont of the ancestral genome. Our analysis of a subset of multifunctional metabolic shows that the domains of these proteins are of mixed phylogenetic origin. The pathways with these novel multifunctional enzymes may have unique catalytic or regulatory efficiencies. The majority of multifunctional proteins have catalytic domains that suggest they function in regulatory pathways. Two lines of evidence suggest that such proteins may define novel signaling pathways. The first is our failure to identify orthologs for such proteins in other eukaryotes.

Secondly, our ongoing analysis of the phylogenetic origins of domains within multifunctional proteins suggests that the domains of such proteins are not monophyletic. This may point to a novel evolutionary strategy distinct from the higher eukaryotes that we have termed genome acquisition and recombination. 11

REFERENCES

1 Wrather, JA, Stienstra WC, Koenning SR (2001) Soybean disease loss estimates for the United States from 1996 to 1998. Can. J. Plant Pathol. 23:122-131.

2 D. C. Erwin, O. K. Ribeiro. Phytophthora diseases world wide. APS Press, 1996. St Paul

Minn.

3 D.M. Rizzo, M. Garbelotto, E. M. Hansen, Annu. Rev. Phytopathol. 2005. 43: p. 309.

4 A. G. B. Simpson, A. J. Roger, The ‘real’ kingdoms of eukaryotes. Current Biology, Year. 14

No 7: p. 693- 696.

5 H. S. Yoan, J.D. Hackett, G. Pinta, D. Bhattacharya, The single, ancient origin of chromist plastids. Proc. Natl. Acad. Sci., 2002. 99: p. 15507-12.

6 B. M. Tyler et al., Phytophthora Genome Sequences Uncover Evolutionary Origins and

Mechanisms of Pathogenesis. Science, 2006. 313: p. 1261-66.

7 R. F. Doolittle, Do you dig my groove?. Nat Genetics, 1999. 23: p. 6-8.

8 E. M. Marcotte, M. Pellegrini, H. Ng, D. W. Rice, T. O. Yeates, D. Eisenberg, Detecting

Protein Function and Protein-Protein Interactions from Genome Sequences. Science, 1999. 285: p. 751-753.

9 S. K. Kummerfeld, C. Vogel, M. Madera, M. Pacold, S. A. Teichmann, Evolution of Multi- domain Proteins by Gene Fusion and Fission. Trends Genet, 2004. 21:p. 25-30.

10 E.M. Zdobnov, R. Apweiler, InterProScan - an integration platform for the signature- recognition methods in InterPro. Bioinformatics, 2001. 17(9): p. 847-8.

11 S. F. Altschul, et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10.

12 D. P. Wall, H. B. Fraser, A. E. Hirsh, Detecting putative orthologs. Bioinformatics, 2003.

19(13): p. 1710-1. 12

13 J. Felsenstein, PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics, 1989. 5: p.

164-166.

14 T. Makiuchi, T. Nara, T. Annoura, T. Hashimoto T. Aoki, Occurrence of multiple, independent gene fusion events for the fifth and sixth enzymes of pyrimidine biosynthesis in different eukaryotic groups. Gene, 2007. 394(1-2):p. 78-86.

15 D. Maximilian, G. Sherlock, G. Binkley, J. Heng, J. C. Matese, T. Hernandez-Boussard, C. A.

Rees, J. M. Cherry, D. Botstein, P. O. Brown, A. A. Alizadeh, SOURCE: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids

Research, 2003. 31(1):p. 219-22.

16 T. Blomme, K. Vandepoele, S. De Bodt, C. Simillion, S. Maere, Y. Van de Peer, The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biology, 2006. 7.

17 P. Dehal, J. L. Boore, Two rounds of whole genome duplication in animal genomes. Plos biology, 2005. 3:p. e314.

18 K. Manolis, B. W. Birren, E. S. Lander, Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature, 2004. 428:p. 617-624.

19 S. R. Wessler, J. C. Carrington, Genome studies and molecular genetics The consequences of gene and genome duplication in plants. Current Opinion in Plant Biology, 2005. 8:p. 119–121. 13

APPENDIX 14

Table 1. Summary of P. sojae multifunctional proteins

Orthologs Orthologs Orthologs Name to First to Second Half 1 Function Half 2 Function to Whole Half Half 108305 No Yes Yes Glyceraldehyde 3-phosphate dehydrogenase Glyceraldehyde 3-phosphate dehydrogenase 108332 No No No 1-aminocyclopropane-1-carboxylate synthase Aminotransferases class-I pyridoxal-phosphate-binding site 109321 Yes Yes No Dihydrofolate reductase Thymidylate synthase 110986 Yes No Yes Papain cysteine protease (C1) Papain cysteine protease (C1) 112102 No No No Adenylylsulfate kinase Inorganic pyrophosphatase 116413 Yes No Yes Cellulose-binding domain, fungal type Apple domain 116824 No Yes No Phox-like Protein kinase 121267 Yes Yes Yes Calponin-like actin-binding Calponin-like actin-binding 127111 No Yes No Protein kinase Protein kinase 127112 No No No Protein kinase Protein kinase 127327 No Yes Yes Spc97_Spc98 Spc97_Spc98 127360 No Yes Yes GAF domain 3'5'-cyclic nucleotide phosphodiesterase 127507 No No No Myosin head WW/Rsp5/WWP 127510 Yes No Yes IQ calmodulin-binding region WW/Rsp5/WWP 127556 No No No Ankyrin Protein kinase 127616 No No No Ankyrin Protein kinase 128000 No No No Zn-finger, TRAF type Zn-finger, TRAF type 128108 No No No Calcium-binding EF-hand Peptidylprolyl isomerase, FKBP-type 128154 Yes No Yes Hydroxymethylglutaryl-coenzyme A synthase Beta-ketoacyl synthase 128164 No No No Zn-finger, FYVE type GAF domain 128404 Yes No Yes Protein of unknown function UPF0083 Calcium-binding EF-hand 128465 Yes No Yes dysferlin 128467 No Yes No Tudor domain Cytochrome c heme-binding site 128512 No No No Ion transport protein cyclic nucleotide-binding domain 128561 No No No Ion transport protein cyclic nucleotide-binding domain 128566 No No No Ion transport protein cyclic nucleotide-binding domain 128682 No No No Ankyrin ankyrin 128684 No No No RNA-binding region RNP-1 Calcium-binding EF-hand 15

128770 No No No Zn-finger, FYVE type GAF domain 128810 No Yes No pleckstrin like pleckstrin like 128860 Yes Yes Yes Zn-finger, FYVE type Protein kinase 128889 No No No Zn-finger, FYVE type TPR repeat 128938 No Yes No Zn-finger, FYVE type TPR repeat 129030 No No No Prephenate dehydratase Aminotransferase, class I and II 129035 Yes Yes No Prephenate dehydratase Aminotransferase, class I and II 129073 No Yes Yes Cation channel, non-ligand gated 3'5'-cyclic nucleotide phosphodiesterase 129196 Yes No No Calcium-binding EF-hand fibronectin, type III 129281 No No No Unknown RNA-binding region RNP-1 (RNA recognition motif) 129310 Yes Yes Yes Zn-finger, FYVE type GAF domain 129468 Yes No No C2 domain Calcium-binding EF-hand 129478 Yes No Yes TPR repeat Calcium-binding EF-hand 129492 No No No Zn-finger, RING Tubby 129511 No No No Ankyrin Protein kinase 129604 No No No Unknown Ion transport protein, K+ channel, pore region 129610 Yes Yes Yes Adenylylsulfate kinase ATP-sulfurylase 129627 Yes No Yes Protein kinase pleckstrin like 129631 No Yes No Unknown Unknown 129674 Yes Yes Yes Regulator of G protein Regulator of G protein 129770 Yes Yes Yes IQ calmodulin-binding region ATP/GTP-binding site motif A 129786 Yes Yes Yes Zn-finger, FYVE type Steroidogenic acute regulatory protein 129826 No No No PDZ/DHR/GLGF domain HEAT repeat 129859 No Yes Yes Carbohydrate-binding module, family 25 Alpha amylase, catalytic domain 129928 Yes Yes No Unknown Actin-binding FH2 129945 Yes No Yes Myosin head (motor domain) 10 calmodulin-binding region, ankyrin 129974 No No No DNA/pantothenate metabolism flavoprotein Lipase, class 3 130091 No No No WW/Rsp5/WWP ankyrin 130115 Yes Yes No Ion transport protein Ryanodine receptor 130134 Yes Yes Yes RabGAP/TBC domain PDZ/DHR/GLGF domain 130174 No No No pleckstrin-like phosphatidylinositol-4-phosphate 5 kinase 130241 No No No pleckstrin-like Protein kinase 16

130470 No No No Unknown Heavy metal transport/detoxification protein 130500 No No No Ankyrin Cyclin-like F-box 130660 No No Yes Unknown Unknown 130733 No No No 4Fe-4S ferredoxin, iron-sulfur binding domain Calcium-binding EF-hand 130785 No No No basic helix-loop-helix dimerization domain bHLH Beta and gamma crystallin 131081 No No No Unknown Zn-finger, modified RING 131275 No No No Ion transport prote Ion transport prote 131295 Yes No Yes Protein kinase Protein phosphatase 2C-like 131310 Yes No Yes Unknown ankyrin 131511 No Yes No Unknown 3'5'-cyclic nucleotide phosphodiesterase 131551 No Yes Yes pleckstrin like Protein kinase 131558 No No No Mannitol dehydrogenase unknown 131574 No No No Pleckstrin/ G-protein, interacting region Protein kinase 131578 No No No Lipid-binding START Zn-finger, FYVE type 131604 No No No Lipid-binding START Zn-finger, FYVE type 131776 Yes Yes Yes male sterility protein Phospholipid/glycerol acyltransferase 131820 No Yes Yes SH3 domain zinc-binding, LIM 131870 No No No Calcium-binding EF-hand Calcium-binding EF-hand 132073 No No No Ctr copper transporter Heavy metal transport/detoxification protein 132089 No No No regulator of chromosome condensation BTB/POZ domain 132153 No Yes No pleckstrin like Phosphatidylinositol 3- and 4-kinase 132156 Yes No No Pectate lyase Pectate lyase 132157 Yes No No Pectate lyase Pectate lyase 132235 No No No G protein beta WD-40 repeat Calcium-binding EF-hand 132410 No Yes No G-protein-coupled receptor family 2 phosphatidylinositol-4-phosphate 5 kinase like protein 132433 No No No cyclic nucleotide-binding domain Ion transport protein 132531 No No No Leucine-rich repeat ribonuclease inhibitor subtype Calcium-binding EF-hand Tyrosine specific protein phosphatase and dual specificity 132560 No No Yes Zn-finger, FYVE type protein phosphatase 132571 No No No Ubiquitin thiolesterase, family 2 Ubiquitin thiolesterase, family 2 132772 No No No cyclic nucleotide-binding domain Mitochondrial carrier protein 132790 No Yes Yes cyclic nucleotide-binding domain Regulator of G protein 132793 No Yes No BED finger Demethylmenaquinone methyltransferase 17

132804 No No No Ankyrin Phospholipase A2 132853 No Yes No 6,7-dimethyl-8-ribityllumazine synthase Calcium-binding EF-hand 132947 No No No PDZ/DHR/GLGF domain unknown 132963 No No No Zn-finger, RING Phox-like 133120 Yes Yes Yes Aminotransferase, class I and II Dihydrodipicolinate reductase 133201 No No No Adenylate kinase Protein kinase 133512 No Yes Yes PDZ/DHR/GLGF domain TLDc 133527 Yes No No Protein kinase Calcium-binding EF-hand 133541 No No No Protein kinase Rho GAP domain 133601 Yes Yes Yes Leucine-rich repeat ribonuclease inhibitor subtype Leucine-rich repeat ribonuclease inhibitor subtype 133628 Yes Yes No glutathione S-transferase glutathione S-transferase 133630 Yes No No glutathione S-transferase glutathione S-transferase 133865 No No No Protein kinase Protein kinase 133867 No Yes Yes forkhead-associated Protein phosphatase 2C subfamily 133945 No No No Zn-finger, FYVE type Ubiquitin-specific protease DUSP 134002 No No No Integrins alpha chain Zn-finger, FYVE type 134003 No No No Integrins alpha chain Zn-finger, FYVE type 134136 No Yes Yes pleckstrin like Phosphatidylinositol 3- and 4-kinase 134235 No Yes Yes EGF-like domain Glycosyl transferase, family 48 134325 No Yes Yes Zn-finger, FYVE type Actin-binding FH2 134329 No No Yes General substrate transporter Glycosyl transferase, family 48 134360 No No No GCN5-related N-acetyltransferase unknown 134409 No Yes No Zn-finger, FYVE type GAF domain 134417 No No No G protein beta WD-40 repeat Calcium-binding EF-hand 134555 Yes No Yes PWWP domain Mov3 family 134569 No No No Calcium-binding EF-hand G protein beta WD-40 repeat 134574 Yes Yes Yes Calcium-binding EF-hand Calcium-binding EF-hand 134593 No No No Zn-finger, FYVE type ATP-dependent DNA ligase 134607 No No No Zn-finger, FYVE type RabGAP/TBC domain 134637 No No No Calcium-binding EF-hand Mg2+ transporter protein, CorA-like 134643 No Yes No Calcium-binding EF-hand Mg2+ transporter protein, CorA-like 134705 No No No , Ran binding Protein of unknown function DUF609 134761 No No No Unknown unknown 18

134863 No Yes Yes C2 domain RabGAP/TBC domain 135053 No Yes Yes G protein beta WD-40 repeat G protein beta WD-40 repeat 135354 Yes Yes Yes Orotidine 5'-phosphate decarboxylase Phosphoribosyltransferase 135428 No No No fibronectin, type III Kelch repeat 135611 No No No G protein beta WD-40 repeat ATP/GTP-binding site motif A 135714 Yes Yes Yes Phox-like Ubiquitin thiolesterase, family 2 135760 Yes Yes Yes HAD-superfamily hydrolase, subfamily IB ATP/GTP-binding site motif A 135814 No No No Phox-like Phox-like 135844 No Yes No Unknown Unknown 135918 No Yes No Phox-like Zn-finger, RING 135991 No Yes No Zinc finger, C2H2-type NUDIX hydrolase 136010 No No No pleckstrin like Protein kinase 136117 No Yes No Phox-like regulator of chromosome condensation, RCC1 136186 No Yes Yes TPR repeat Cadherin cytoplasmic region 136407 Yes No No Cyclin-like F-box NHL repeat 136466 No No No E-class P450, group I E-class P450, group I 136471 No No No IQ calmodulin-binding region ankyrin 136487 No Yes Yes Phox-like Protein kinase 136563 Yes No No RabGAP/TBC domain ankyrin Serine/threonine-specific protein phosphatase and bis(5- 136608 No No Yes Calcium-binding EF-hand nucleosyl)-tetraphosphatase 136693 No Yes No pleckstrin like Lipid-binding START 136761 No No No Zn-finger, FYVE type Phosphatidylinositol 3- and 4-kinase 136867 No No No similar to Regulator of G protein signalling similar to Regulator of G protein signalling 136993 No Yes Yes cyclic nucleotide-binding domain Glycerophosphoryl diester phosphodiesterase 137195 No Yes Yes Lipase, class 3 Zn-finger, FYVE type 137226 No Yes Yes Calpain-type cysteine protease, C2 family Unknown 137309 Yes Yes Yes IQ calmodulin-binding region WW/Rsp5/WWP 137315 No No No von Willebrand factor, type A Zn-finger, RING 137318 No No No DNA repair protein DEAD/DEAH box helicase 137342 No No No von Willebrand factor, type A ATP/GTP-binding site motif A 137357 No No No cyclin-ike F-box Zinc finger, C2H2-type 137454 Yes No Yes RabGAP/TBC domain fibronectin, type III 19

137502 Yes No No Zn-finger, FYVE type Zn-finger, FYVE type 137719 No Yes Yes EGF-like domain Zinc transporter ZIP 137758 No No No Glycosyl transferase, family 28 Glycosyl transferase, family 28 137759 No Yes No Zn-finger-like, PHD finger Actin-binding FH2 137793 Yes Yes No OTU-like cysteine protease proline-rich region 137843 Yes No No Myosin head (motor domain) IQ calmodulin-binding region 137935 No No No Protein kinase RNA-directed DNA polymerase 137960 No No No Calcium-binding EF-hand Mitochondrial substrate carrier 138031 No No No Fatty acid desaturase, type 2 Zn-finger, FYVE type 138087 No No No Protein kinase Protein kinase 138204 No Yes Yes zinc finger ZZ-type ubiquitin-associated domain 138371 Yes No Yes Cold-shock DNA-binding domain Cold-shock DNA-binding domain 138397 Yes Yes Yes Cyclin, N-terminal domain Protein kinase 138430 No Yes No Calcium-binding EF-hand C2 domain 138460 No No No Zn-finger, FYVE type Legume lectin, beta domain 138514 No Yes No Actin-binding, actinin-type Actin-binding, actinin-type 138641 No Yes Yes Zn-finger, FYVE type Cellular retinaldehyde-binding/triple function, C-terminal 138708 No No No pleckstrin like PDZ/DHR/GLGF domain 138725 No No No Unknown Unknown 138742 No No No Centromer protein B, DNA-binding domain Short-chain dehydrogenase/reductase SDR 138788 No No Yes WW/Rsp5/WWP Cyclin 138836 No No No pleckstrin like Protein kinase 138878 Yes No No TPR repeat TPR repeat, Zn-finger, MYND type 138929 No No No GAF domain Zn-finger, FYVE type 138980 No No No Histone H4 Unknown 139041 No Yes No IQ calmodulin-binding region Inositol polyphosphate related phosphatase family General substrate transporter, Glycosyl transferase, family 48, 139198 No No Yes ATP/GTP-binding site motif A (P-loop) major facilitator superfamily 139331 Yes No Yes C2 domain C2 domain 139448 No No Yes Unknown Armadillo repeat Vitamin K-dependent carboxylation/gamma-carboxyglutamic 139531 Yes Yes Yes ATP/GTP-binding site motif A (P-loop) region 139621 No Yes Yes G-protein-coupled receptor family 2 Phosphatidylinositol-4-phosphate 5-kinase 20

139641 No No Yes Ankyrin Protein kinase 139763 No No No Zn-finger GAF domain 139983 No No No pleckstrin like Lipid-binding START 139991 No No No forkhead-associated ATP/GTP-binding site motif A (P-loop) 140065 No No No Unknown Zn-finger, RING 140109 No No No Unknown Zn-finger, FYVE type 140112 No Yes Yes Protein kinase Armadillo repeat 140145 No Yes Yes Unknown Tubulin-tyrosine ligase 140152 No Yes No Calponin-like actin-binding Unknown 140223 No No No Ubiquitin domain Zn-finger, RING 140234 No No No Ubiquitin domain Zn-finger, RING 140325 No No No Leucine-rich repeat ribonuclease inhibitor subtype Short-chain dehydrogenase/reductase SDR 140465 No Yes Yes Unknown Calcium-binding EF-hand 140602 No Yes No Unknown Unknown 140611 No No No Ion transport protein cyclic nucleotide-binding domain 140672 Yes No No Calcium-binding EF-hand, leucine-rich repeat Unknown 140735 No Yes Yes Phospholipid-translocating P-type ATPase, flippase Guanylate cyclase 140843 Yes No Yes Protein kinase Protein phosphatase 2C like 140859 No No No C2 domain Protein kinase 140880 Yes Yes Yes myosin head (motor domain) CBS domain Cation channel, non-ligand gated, Ion transport protein, Cation 140978 No Yes Yes Ryanodine receptor (not K+) channel, TM region 140988 Yes No Yes Unknown Protein kinase 140999 No Yes Yes Unknown Mg2+ transporter protein, CorA-like 141052 Yes No No ATP/GTP-binding site motif A (P-loop) Oxidoreductase, molybdopterin binding 141080 No Yes Yes Peptidylprolyl isomerase, FKBP-type Peptidyl-prolyl cis-trans isomerase, cyclophilin type 141193 No No No General substrate transporter kinesin motor region 141211 No No No Unknown Zn-finger, FYVE type, GAF domain 141796 Yes Yes Yes pleckstrin like Lipid-binding START 141833 No No No Armadillo repeat Armadillo repeat 141859 No No No pleckstrin like Zn-finger, FYVE type 141887 Yes No No myosin head (motor domain) ankyrin 141900 No No Yes Unknown Glycosyl transferase, family 48, General substrate transporter Rhodpsin-like GPCR superfamily, G-protein coupled 141988 No No No Phosphatidylinositol-4-phosphate 5-kinase receptors family 2 141994 No No No WW/Rsp5/WWP ankyrin 21

142173 No No No 10- calmodulin-binding region ankyrin 142204 No Yes No Phox-like Protein kinase 142274 Yes No Yes Leucine-rich repeat ribonuclease inhibitor subtype Calcium-binding EF-hand 142328 No No Yes Myosin head Phox-like 142346 No No No Glycosyl transferase, family 28 Glycosyl transferase, family 28 142375 No No No Zn-finger, FYVE type unknown 142434 No No No Response regulator receiver sterile alpha motif SAM 142501 No No No Pectate lyase Pectate lyase 142577 Yes No No Myosin head PDZ/DHR/GLGF domain 142606 No No No Ion transport protein cyclic nucleotide-binding domain 142615 No No No Ion transport protein cyclic nucleotide-binding domain 142672 No No No ricin B lectin domain ricin B lectin domain 142688 Yes No Yes D-3-phosphoglycerate dehydrogenase Phosphoserine aminotransferase, Methanosarcina type 142862 No No No Calcium-binding EF-hand Calcium-binding EF-hand Catalytic domain of components of various dehydrogenase 142970 No No No Lipid-binding START complexes 143054 No No No Glycosyl transferase, family 48 General substrate transporter 143092 No No No ATP/GTP-binding site motif A Protein kinase 143101 No No No Calcium-binding EF-hand Recoverin 143228 No No No pleckstrin like Protein kinase 143244 Yes No Yes pleckstrin like Protein kinase 143397 No No No Zn-finger-like, PHD finger Bipartite nuclear localization signal 143580 Yes Yes Yes Transcriptional coactivator p15 (PC4) Transcriptional coactivator p15 (PC4) 143645 No No No Myosin head Zn-finger, FYVE type 143651 No Yes Yes cyclic nucleotide-binding domain Protein phosphatase 2C-like 143782 Yes No No TPR repeat heat shock protein hsp70 143940 Yes No No pleckstrin like Protein kinase 143947 No No No Zn-finger, FYVE type unknown 143948 No No No Zn-finger, FYVE type unknown 22

143949 Yes No Yes Ankyrin Protein kinase 143950 No No No Protease-associated PA ankyrin 143965 Yes Yes Yes Molluscan rhodopsin C-terminal tail Calcium-binding EF-hand 144250 No Yes No Apple domain Apple domain 144366 No No No Protein kinase Protein kinase 144802 No Yes No pleckstrin like Protein kinase 144807 No No No pleckstrin like Protein kinase 144830 Yes Yes No Thioesterase AMP-dependent synthetase and ligase 144890 No No No Centromere Protein B, helix-turn-helix Short-chain dehydrogenase/reductase SDR 144923 No No No Unknown Pectate lyase 145029 No Yes No Prephenate dehydratase Aminotransferase, class I and II 155208 No No No SAM (and some other nucleotide) binding motif aspartyl/asparaginyl beta-hydroxylase 155257 No No No Integrins alpha chain Na-Ca exchanger/integrin-beta4 155337 No Yes No Myosin head pleckstrin like 156594 No Yes No cyclic nucleotide-binding domain Regulator of G protein 156898 No No No Ankyrin Guanine-nucleotide dissociation stimulator CDC25 156906 No No No Calponin-like actin-binding Myosin head 156997 Yes No No Phosphoadenosine phosphosulfate reductase CysH-type Glutaredoxin 157108 No Yes No Protein kinase Phosphotyrosyl phosphatase activator, PTPA 157576 No No No ATP/GTP-binding site motif A von Willebrand factor, type A 157625 No No No Unknown unknown 157928 No No No cyclic nucleotide-binding domain forkhead-associated 158721 No No No cyclic nucleotide-binding domain cyclic nucleotide-binding domain 158849 No Yes No C2 domain Calcium-binding EF-hand

23

Table 2: Rosetta stone candidates

P. sojae ID Domain A Domain B Activity A Activity B

143965 A. fumigatus Q4WPR6 A. fumigatus Q4WDF2 Calcium dependent Calcium binding membrane targeting

141796 A. gambiae Q7Q4U8 A. gambiae Q5TVR3 Pleckstrin homology No predicted activity

137793 B. rerio Q6NW80 B. rerio Q6PCS2 Cysteine protease Proline-rich glycoprotein

135760 H. sapiens Q8TCD6 H. sapiens Q8TCT1 Pyridoxcal phosphate Phosphoethanolamine T. nigroviridis Q4SF86 T. nigroviridis Q4SET6 phosphatase

134574 B. rerio Q8UUX9 B. rerio Q7SXX4 Ca binding EF hand Ca binding EF hand C. briggsae Q5WNC1 C. briggsae Q60ME4 C. neoformans Q5KE11 C. neoformans Q9HDE1 D. melanogaster P37236 D. melanogaster P48451 G. gallus P79880 G. gallus P62167 R. norvegicus Q8R426 R. norvegicus P63100 T. nigroviridis Q4SJ38 T. nigroviridis Q4RNK4

131776 C. briggsae Q622T8 C. briggsae Q60Y13 Peroxisomal fatty acyl Peroxisomal 24

D. melanogaster Q9V7S1 D. melanogaster Q8IMM7 CoA reductase dihydroxyacetone H. sapiens Q9H600 H. sapiens Q5TBH6 phosphate O- M. musculus Q9D0Q1 M. musculus P98192 acyltransferase R. norvegicus Q66H50 R. norvegicus Q9ES71 T. nigroviridis Q4TCI8 T. nigroviridis Q4RF56

129674 T. nigroviridis Q4SSQ1 T. nigroviridis Q4RZ90 Regulator of G protein Regulator of G protein

129786 A. gambiae Q7QI56 A. gambiae Q7QIT3 Zn finger FYVE-type Steroidogenic regulatory B. rerio Q7T3F6 B. rerio Q9DFS4 protein D. melanogaster Q8IMP2 D. melanogaster Q9W145

129610 K. lactis Q6CVB5 K. lactis Q6CNU6 Sulfate assimilation Sulfate assimilation ATP adenyl sulfate kinase sulfurylase

25

Figure 1. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein Ps109321 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Dihydrofolate reductase.

26

Figure 2. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein Ps109321 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Thymidylate synthase.

27

Figure 3. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein number Ps112102 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Adenylylsulfate kinase.

28

Figure 4. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein Ps 112102 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Inorganic pyrophosphatase.

29

Figure 5. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein Ps133120 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Aminotransferase, class I and II.

30

Figure 6. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein Ps 133120 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Dihydrodipicolinate reductase

31

Figure 7. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein Ps 135354 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Orotidine 5'-phosphate decarboxylase.

32

Figure 8. Unrooted tree made by the Protein Sequence Parsimony method of P. sojae protein Ps135354 and homologous proteins from animal, fungal, plant, lower eukaryote, and selected bacterial genomes. Numbers on the branches display the level of bootstrap support (out of 1000 replicates). Predicted domain: Phosphoribosyltransferase.