Supplemental data Heidel et al. Table of Contents 1. Sequencing strategy and statistics ...... 2 2. Genome structure ...... 2 2.1 Extrachromosal elements ...... 2 2.2 structure ...... 3 2.3 Repetitive elements ...... 5 3. Coding sequences ...... 5 3.1 Homopolymer tracts ...... 5 3.2 families and orthology relationships ...... 7 3.3 Synteny analysis ...... 11 4. functional domains ...... 12 5. Protein families ...... 13 5.1 Primary metabolism ...... 13 5.2 Secondary metabolism ...... 16 5.3 Cell shape, organization and motility ...... 19 5.3.1 Kinesins ...... 19 5.3.2 ...... 21 5.3.3 The microfilament system ...... 22 5.4 Gene transcription ...... 32 5.5 Cell adhesion ...... 34 5.6 Cell signaling ...... 37 5.6.1 Seven transmembrane domain receptors including G‐protein coupled receptors ...... 37 5.6.2 Cyclic nucleotide synthesis, detection and degradation ...... 40 5.6.3 Sensor histidine kinases ...... 43 5.6.4 Monomeric G‐ ...... 45 5.6.5 ABC transporters ...... 49 6. Supplemental Methods ...... 52 6.1 DNA isolation ...... 52 6.2 Sequencing and Assembly ...... 52 6.3 Chromosome structure analysis ...... 53 6.4 Gene prediction, Blast and synteny analysis ...... 53 6.5 Gene family detection using domain analysis ...... 53 6.6 Protein variation ...... 53 6.7 Phylogeny and species split dating ...... 54 7. References ...... 57 1. Sequencing strategy and statistics

The sequencing projects for both the P. pallidum (PP) and D. fasciculatum (DF) genomes were initiated by paired-end Sanger sequencing of 1-2 kb sheared genomic DNA inserts cloned in pUC18, soon to be followed by four runs of more cost-effective pyrosequencing using the 454 Roche platform, yielding a total coverage of over 15 x for each genome (Table S1.1). A dense physical map of both genomes was prepared by paired-end sequencing of ~30 kb genomic DNA fragments inserted into the fosmid vector pCC2FOS. The gaps within supercontigs were filled in by primer walking, until the sequence at both sides of the gap became too repetitive for primer design. At this point only 52 and 33 gaps remained in the PP and DF genomes respectively, which compares very favorably against the current state of 226 gaps in the DD genome, 5 years after completion and continued polishing (Table 1, main text).

Table S1.1: Sequencing statistics P. pallidum D. fasciculatum reads from fosmid clones 7937 5033 reads from small insert library clones 112268 80433 454 runs/MB raw data 4/416 4/418 Contigs from initial newbler assembly 6352 7292 gap closing reads 1068 1165 genome coverage 14.4 x 14.9 x

Final assembly: contigs/supercontigs 52/41 33/25

2. Genome structure

2.1 Extrachromosal elements Metazoans and plants integrate tandem arrays of rRNA into existing (Schwarzacher and Wachtler 1993) to provide a large number of copies for simultaneous transcription. In contrast, some protists, such as Tetrahymena (Blomberg et al. 1997) amplify their rRNA genes on extrachromosomal palindromes. All analysed Dictyostelia contain an amplified extrachromosomal sequence that codes for rRNA genes. Sequences of extrachromosomal elements are highly overrepresented in the sequencing reads due to their high abundance. Yet, repetitiveness prevented their assembly from 454 read data alone. Thus, small insert library reads were used to assemble the full length palindrome arms. rRNA genes were defined based on homology to the DD counterparts and the common eukaryote set of rRNA genes. Based on the relative abundance of the sequencing reads that match the palindromes, compared to unique parts of the genome, the PP, DF and DD palindromes make up around 5, 9 and 25% of the genomic DNA content, respectively. This difference is mainly attributable to the shorter palindrome arms in PP (15 kb) and DF (26 kb) compared to 45 kb for DD. Thus, the number of palindrome molecules is in the same range in all species. The palindrome organization is the same as in DD (Sucgang et al. 2003): the rRNA genes reside at the ends of the arms with transcription directed towards the telomeric regions. The observed plasticity in palindrome size is thus due to different amounts of repeated sequences in their central regions. To highlight conserved regions in extrachromosomal palindromic elements of DD, PP and DF, the full (DD) or half palindromes (PP and DF) were aligned and pairwise represented in a dot-matrix using Tuple_plot (Szafranski et al. 2006), an algoritm which reduces noise caused by repetitive sequence. Only the regions that contain the ribosomal RNA genes are conserved between the three palindromes and these regions are typically located at the end of the palindrome arms (Fig. S2.1)

2

A Comparison between the extrachromosomal elements of DD (X-axis) and DF (Y-axis).

B Comparison between DD (X-axis) and PP (Y-axis).

C Comparison between DF (X-axis) and PP (Y-axis).

Figure S2.1: Pairwise alignments between DD, PP and DF extrachromosomal elements represented as Tuple_plots. The full palindrome is shown only for DD. Yellow shadowing denotes the positions of the rRNA genes. Black dots represent tuples in the same direction, red dots inverse tuples. Representations are drawn to scale so that the figures are comparable.

2.2 Chromosome structure In DD no telomere structures common to eukaryotes are present. However, when we searched the assembled genomes of PP and DF for the common TTTAGA motif and variations thereof, we found repeated structures of such motifs in both species (14 in PP and 12 in DF). The motifs were all located at contig ends indicating a role as telomeres. In case of DF half of these structures were directly associated with DIRS transposable elements reflecting the location of DIRS elements at telomer ends in DD. The repeated telomere motifs in PP were found in the neighborhood of complex structures (see main text) but no DIRS elements could be found in the entire PP genome. Figure S2.2 shows the chromosomes of PP separated by Pulsed Field Gel Electrophoresis (PFGE), confirming the prediction of 7 chromosomes from 14 telomeres.

3 Figure S2.2: Pulsed-field gel separation of Pp PN500 chromosomes separated as described in Cox et al. 1990 (Cox et al. 1990). The pulse conditions were 1245 sec pulse times for 90 hrs followed by 2000 sec pulse times for 72 hrs at 4 V/cm. Gel image kindly provided by E.C. Cox.

Eukaryote telomere maintenance requires a functional TERT protein. In the original DD annotation of the proteome no such protein was found. The comparative analysis revealed that such a protein is present in all social amoebae, but the domain responsible for telomere lengthening is interrupted by a polyN stretch in DD (Figure S2.3). This may render the domain non-functional and could have necessitated a novel mode of telomere end formation in DD. Alternatively, the modified telomere may have come first, allowing TERT gene function to erode.

Figure S2.3: TERT domain alignments. The domain positions were extracted using IPRscan, the alignment was done with clustalx with manual correction. Asterisks denote identical amino acids and colons similar amino acids.

TERT_PP NVASVVENKKDMREIGRQEVLEVFDELICEHILKYGEDYYLQTKGIAQGA TERT_DD QSETIIYDSKYQQIIEKNQLLHLFHSHIFNHYIRYSSKLFHQVVGIGQGS TERT_DF SKNIKIFSINQGKIISRHQVLRSLKEFIFRTIVKYGNQYFFQVQGIAQGC . : . : : * ::::*. :.. * . ::*... : *. **.**.

TERT_PP KSSTKLCELELGDLDNQLLMPRLLDNNPN------TERT_DD LCSPHLCSLLLGDLENEILLP-LLNNNKNNNNNNNKNNNNKNNNNNNNNN TERT_DF IISPLLCNIILGHLDTNFLLPNIMANDTTPN------*. **.: **.*:.::*:* :: *: .

TERT_PP ------AFNALVR TERT_DD NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNLEFNFLVR TERT_DF ------TLNEMMR :* ::*

TERT_PP FVDDYIYFTDSPSNTERFIDVFTNET-PALYGVKSNNQKTKQFLHDFGKA TERT_DD FIDDYLYVSTEIDNLKSFKNLFHNG--ITEYGVKANQSKTKYYYSDDENS TERT_DF YMDDYIFITTSYKNAIFFKYLLSSKESIDQYGVKTNPLKGNQFFPCNTEE ::***::.: . .* * :: . ****:* * : : :

TERT_PP KGIVELDG-----KYMPWCGLLINVESFEILYDYSKYSGKKLINEL- TERT_DD GLIKRGDGDDNDQLFIAWCGYLINCKTFEVQKDYSRLS-KKVNNDNF TERT_DF NKGLIQIG---YDSYIPWCGNLINVQSLAVLCDYGHNLNGNGGRKV- * ::.*** *** ::: : **.: : ..

4

2.3 Repetitive elements We converted the whole assembled genomes of the species to BLASTable databases and searched for transposable elements (TEs) using 6 frame translations of the known DD retroelements. Furthermore, we scanned the genomes for presence of reverse transcriptase domains to find even remotely similar TEs. This approach showed that only in DD repetitive elements contribute considerably to the genome size (Table 1, main text). Previously defined DNA elements of DD have no common sequence structure which could be used for detection. Thus, to find these TEs we relied solely on sequence abundance. To this end we extracted all highly similar sequences (>95 % over at least 200 base pairs). We then removed sequences resulting from recent gene family expansions like genes. We found no further long sequences indicative of TEs. Thus we conclude that DNA TEs are either absent or present only once per kind in each genome.

3. Coding sequences

3.1 Homopolymer tracts DD proteins very often contain long homopolymeric tracts of either polyasparagine (polyN) or polyglutamine (polyQ), which thus far have not been shown to interfere with protein function. This is rather remarkable since polyglutamine tracts in metazoan proteins cause protein aggregation and a broad range of disorders (Hughes et al. 2001) (Bauer and Nukina 2009). We investigated whether polyN and polyQ tracts were also common features of PP and DF proteins. Figure S3.1 shows that the number and size of polyN tracts is much lower in PP and DF than in DD, while poly-Q tracts are only slightly reduced. Tracts mainly consist of A/T rich codons, which for both N and Q are considerably enriched in the three genomes. The abundance of polyN tracts in DD result in a 1.5-fold enrichment in N compared to DF and PP, but a 5-fold enrichment of N’s most A/T rich codon AAT (Table S3.1). Since 11 % of the DD amino-acids are asparagines, the polyN tracts contribute considerably to the overall A/T richness of this genome.

5

Figure S3.1: Homopolymer tracts in the three proteomes. X-axis: length of homopolymer tract, Y-axis : number of occurrences. The respective tracts were extracted from the protein sequences and counted using a perl based custom made script. N stands for asparagine, Q for glutamine.

DD DF PP Glutamine (Q) 5.03 5.23 4.91 CAA 4.83 4,713.84

CAG 0.19 0.521.07 Asparagine (N) 11.31 7.17 7.64 AAU 10.13 4,984.99 AAC 1.17 2.192.65

Table S3.1: percent codon preference for glutamine and asparagines. The first row indicates total percentage of occurrence while the second and third row for each amino acid indicate the codon used.

6 3.2 Gene families and orthology relationships We used OrthoMCL (Li et al. 2003) to cluster all proteins of the three social amoebae genomes into families. This approach groups the proteins connected by significant BLAST hits into families. This way, even remotely related proteins can be clustered together if some part of their sequence is similar to a family member. However, faint similarities comprising only a small part of the proteins will go undetected. Most proteins form triplets (peak at 3 in Figure S3.2) and 94% of these 3 member gene families contain all three species (4945 gene families). This set makes up the shared core of single genes which are present only once in each genome with clear orthology relationship between the three species. Some families are present in all three dictyostelids but amplified in one or the other species. The protein numbers in the different species of the ten largest families are listed in Table S3.2. Nearly 1200 and 1500 gene families are restricted to one or two species, respectively, according to this search (Table S3.3). Manual examination of the largest ten protein families with only two species revealed that indeed 8 out of 10 are clearly not present in at least one species. For the remaining two families, we did find members in the missing species, but they shared only low similarity with those in the other two species. Interestingly, nine families of all ten families comprise only hypothetical proteins, the remaining one consists of DIRS derived sequences. Thus, the OrthoMCL approach appears to fail in grouping some protein families with very low similarity. We conclude that sequence erosion due to long evolutionary separation renders ortholog detection and even gene family definition impossible if only weak functional constraints restrict sequence evolution. Thus, the OrthoMCL analysis yields a somewhat blurred picture of gene family organization in dictyostelids.

Orthology relationship detection is another way to define and group gene sets. With this approach we define the closest relative in another genome, irrespective of its membership to a larger gene family. Detecting orthologous proteins was done using reciprocal BLAST hits (BBHs). The number of detected orthologous pairs is directly dependent on the identity threshold used (Figure S3.3). Below a certain threshold the BLAST hits become spurious and are not reliable for ortholog detection. The example alignments at low identity levels in Figure S3.4 show that compared to a simple BLAST output, the alignments can be improved using a dedicated alignment program. Based on random sampling we estimate that the lowest BLAST identity threshold usable for ortholog detection is around 30 %. The Venn diagram of Figure S3.5 shows that slightly more pairs between DF and PP exist than between each of these species and DD emphasizing the closer relationship of these two species. This approach yields a slightly higher number of orthologous triplets as the OrthoMCL analysis as can be seen from the common protein core in Figure S3.5.

We next compared the OrthoMCL and BBH results including information on domain content of proteins. Of the nearly 38,000 proteins from all species only 5,604 do not belong to an OrthoMCL family, or have BBHs in another species, or contain a functional domain (Figure S3.6). On the other hand 19,081 proteins were defined as orthologous core and at the same time assigned to gene families with the orthoMCL approach. All orthologous core set proteins form families by definition, but 1,020 of them could not be clustered by orthoMCL indicating either 340 false positive core set triplets or constraints in detecting families with low similarities. Taken together we can estimate the common, not eroded gene set for all social amoebae to comprise around 6,000 genes.

7

Figure S3.2: Family size distribution, when using all (nearly 38,000) proteins of the three social amoebae. Most abundant are families with three members, 94 % of which have exactly one member in each species. For better readability the axis scales are logarithmic.

family DD PP DF total 1 0 2037 210 2 1 1821 184 3 3 14158 175 4 3 2155 160 5 2 6126 134 6 70 5113 134 7 115 0 0 115 8 1 990 100 9 83 0 0 83 10 46 15 19 80

Table S3.2: The number of proteins in the 10 largest gene families detected using orthoMCL

taxa Gene families 1 1181 2 1485

3 5446 Total 8112 Table S3.3: Number of gene families that contain one, two or three taxa

8

Figure S3.3: Dependency of orthology detection on the threshold identity used. Each protein set from one species was compared to the other species using bidirectional BLAST. For each species the number of proteins with at least one counterpart in the other genomes was counted dependent on the identity threshold applied X axis: threshold, y axis: number of orthologs.

Global amino acid alignment length: 283 Identical amino acids: 61 of 207 (29.47 %)

DDB0237959 putative peroxisomal membrane protein 1 MVDSCCN...... TNDTINAV...... FHHINKHKKGDLKECLLSAIRGFRNGVLTGVRIRIPYIFQAVIYAVLAGGEEKSIGRV....KFVI MSDQQQDNSQQQPQPTDDTNNQVEEIKLDDSQQDKNRLCKHR....GSCIHSTLYTWARGLLIGYGLRASMALLSAIF...... IRRLYKNPRKLV DFA_01754

101 KQMFYH...... GKNLGMFVGIYKSICCILRNIGIK.GGIDSLVAGFIGGYYAFGESKSVSGSVNNQIVLYLFARALIGIIQGMVKRKIIPQSLSTTTPK NQTLLHKDPIGFGLFLAFYTGGFKGVNCLLRAIRQKEDGYNSAIAGFVAG...... ASMMFSKSTEVALYLFARALESLFNAAWKRGYVKSWKHGDT..

201 283 GFRIFAAVTLALILYLTEYEPENLNASFMGTMTVLYHKS.DSGPMIEKGDHKFGIPMIIIILSLFGGIFPKLSLDSMFFYFKL ALFCFCT.SVMFYAFVWEPNTVRPSYLKFLSKVAGKDRDLSQVTSK...... IRELYYISNGIKPTAA------

Global amino acid alignment length: 294 Identical amino acids: 80 of 206 (38.83 %)

DDB0190543 UBX domain-containing protein

MNIGALLWVLFLVACSFTILYEYFAPRLEKKKWDKKKFIKDNFEMNKEKSDIVDKKQKKHNSLAKENEIKRREKLLNDLLNKLVFD...... KNKETENN ------MSKERIVEVDDKQVDHVKLAHTKMVDDE....SDRVNKLIESYEKLQGKSQQEKNN DFA_09856

101 RK...LGKRLIDDNENKDEMNNNNYLSETERIIKEQDIEYYKSLETDQLLKLLKEKDINDKKEEQEKLKKQKQERLQFLKLNLKPEP...... PIDNE RKYIYTEGRTISGDEHLQE.NMERWKSDREILREMQDMEYEESLAKDKKLQQRKQEIIIDKEKEVQ....ERQNRIQWLKDNLRPEPIAAAAAADMVVGE

201 NSIK...... LLIKLPNGENIQRRFLKTDTINDIYDFIDSRDQISF.KYSLATNYPKKVYKNDENIKLKSTLEELNITNLATFYLIEF NGSSSSFSSASSNISTIQIKLPSGATLKRRYLLSDTIQDIIDFVDSKEVVQKPRYYLATNIPKQQFR.DTTV....TIQDAQLYPQVSVYVIEE

9 Supplemental figure S3.4: Alignments using the needle program (Needleman and Wunsch 1970) at different thresholds. Identity values are increased compared to original BBH analysis based on BLAST by around 7 %. Amber coloring depicts different amino acids.

Figure S3.5: Venn diagram of shared orthologous proteins. Each circle of the Venn diagram represents one species. Numbers in overlapping areas show common proteins between two or three species. Since differential gene family expansions occurred between the genomes a clear orthology relationship cannot always be established. Thus, each comparison is represented by the outcome of two-way comparisons and the colored numbers refer to the respective reference genome from which the numbers were derived (black: DD; red: DF; green: PP)

10

Figure S3.6: Venn diagram of all proteins showing overlaps between gene family definition by orthoMCL, core genes from a three way BBH, and Interpro domain containing proteins. The circle in the upper right corner depicts proteins with none of these features.

3.3 Synteny analysis We used the orthology relationships obtained by the BBH approach (see supplemental methods) to define syntenic segments between the genomes. The smallest possible syntenic segment would be two orthologous gene pairs separated by not more than 5 other genes in both genomes irrespective of the orientation of the orthologs. The degree of synteny between genomes can be viewed as a measure of the stability of gene order over time. Our analysis shows that gene order conservation is rare in the social amoebae with only around 2000 CDS neighbors being conserved between DD and one of the other genomes (Table S3.4). The closer relationship of DF and PP is reflected by a slightly higher number of syntenic CDS, suggesting that synteny breakup occurs in a time-dependent manner in social amoebae. Most synteny groups are small, the majority consisting of only 2 members. Only 1079 syntenic neighbours are conserved between all species. Thus, less than 10 % of all protein coding genes retained at least one neighboring gene in all species indicating a frequent reshuffling of the genomes. Syntenic regions might therefore have been retained only by chance and are in general not subjected to functional constraints. As was shown for Drosophila species (Bhutkar et al. 2008) synteny breaks are also a measure of divergence time. If we take the estimated break frequency of this lineage (0.1 per Mb and mya) and apply this measure to the social amoebae we would find a lineage split at 1200 mya for all social amoebae and 1000 mya for the split between PP and DF. This is nearly exactly two times the value we found by dating the split using a concatenated data set of 33 proteins. This indicates that synteny break up occurs more frequently in the social amoebas than in Drosophila lineage, or that the split between DD, PP and DF occurred earlier than predicted from protein sequence divergence. . Table S3.4: Synteny groups in the three species synteny group members DD/DF PP/DD PP/DF 13 0 1 0 12 0 0 0 11 0 0 0 10 0 0 0 90 20 80 01

11 71 13 65 46 5 12 17 29 4 43 43 73 3 133 155 219 2 626 716 942 Total 820 939 1273 total number of genes in synteny groups 1920 2216 3043

4. Protein functional domains

A comparison of the total inventory of protein functional domains between the genomes provides information on their relative potential for phenotypic complexity. In addition, comparative analysis of functional domains can aid in establishing orthology relationships between conserved proteins. As described in section 3.2, accumulation of mutations blurs sequence similarities over long evolutionary time periods, thus rendering orthologous pair detection difficult. Functional domains are under purifying selection to retain functionality, and their sequences are therefore better suited for identification of orthologous genes in other genomes. We used InterproScan (www.ebi.ac.uk/Tools/InterProScan/) to identify all functional domains in the three genomes (Table S4.1). There are no large differences between the genomes in terms of number of domains per genome. The number of domains that are only present in one or two genomes is very small (in total 1 – 1.2% of all domains), which means that 99 % of all domains are conserved between species. Some of the dissimilarities between species could be explained by the fact that different Interpro domain definitions often overlap, so that assignment to one or the other definition can depend on only slight differences in the underlying sequence. Other species-specific domain allocations could be false positive predictions. Nevertheless, we found also clear evidence that a handful of domains encode species-specific functions (see main text). Table S4.2 lists the five most abundant domain accessions absent in at least one other species as defined by the InterProScan results.

Table S4.1: Proteins with Interpro domains

DD PP DF Number of domains 23569 22613 21687 Number of proteins with domains 8409 8087 7390 Number of proteins without domains 5024 4286 4783 Number of all proteins 13433 12373 12173 % of proteins with domains 62.60 65.36 60.71 Number of specific domains unique to each genome 187 149 118 Number of specific domains absent in 1 out of 3 genomes 139 134 132

Table S4.2 A: Three most enriched molecular function GO-terms for each species that are missing in at least one other species. Two GOterms are tied for 3rd in PP. DD PP DF Interpro domain description GO:0004965 6 4 0 GABA‐B receptor activity beta‐1,4‐mannosylglycoprotein 4‐beta‐N‐ GO:0003830 4 0 3 acetylglucosaminyltransferase activity GO:0004351 2 0 0 glutamate decarboxylase activity GO:0008061 0 7 0 chitin binding

12 GO:0004568 0 7 5 chitinase activity

GO:0003840 0 6 1 gamma‐glutamyltransferase activity GO:0030414 1 6 0 peptidase inhibitor activity GO:0004568 0 7 5 chitinase activity beta‐1,4‐mannosylglycoprotein 4‐beta‐N‐ GO:0003830 4 0 3 acetylglucosaminyltransferase activity GO:0004026 0 1 2 alcohol O‐acetyltransferase activity

Table S4.2 B: Five most enriched molecular function GOterms for each species relative to the others that are not necessarily species-specific DD PP DF Interpro domain description GO:0003964 107 25 14 RNA-directed DNA polymerase activity GO:0016209 15 4 5 antioxidant activity GO:0016740 86 42 39 activity GO:0000036 49 21 24 acyl carrier activity GO:0015385 6 2 2 sodium:hydrogen antiporter activity GO:0016829 6 19 2 activity GO:0008061 0 7 0 chitin binding GO:0030414 1 6 0 peptidase inhibitor activity GO:0003840 0 6 1 gamma-glutamyltransferase activity GO:0004803 1 5 0 transposase activity GO:0004659 6 3 13 prenyltransferase activity GO:0004838 0 0 1 L-tyrosine:2-oxoglutarate aminotransferase activity GO:0003721 0 0 1 telomeric template RNA reverse transcriptase activity GO:0005179 0 0 1 hormone activity () 8-oxo-7,8-dihydroguanosine triphosphate pyrophosphatase GO:0008413 0 0 1 activity

5. Protein families

5.1 Primary metabolism We have carried out a comparative analysis of biochemical pathways based on the KEGG database (Kyoto Encyclopaedia of Genes and Genomes; http://kegg.jp/) focusing on the carbohydrate, lipid, nucleotide and amino acid metabolism of the three dictyostelids. As motility, an energy consuming process, is highly important for their life style, we have included the citric acid cycle and oxidative phosphorylation in the analysis. The human and yeast pathways serve as a comparison (Table S5.1). Not included in our analysis are signalling pathways. Our analysis shows that all genes encoding the essential of the individual pathways are present and that for all pathways a similar set of enzymes exists with few exceptions (Table S5.1 and S5.2). Some pathways are remarkably different in the dictyostelids. In the sugar metabolism pathway, lactate dehydrogenase (LDH) is absent as well as pyruvate carboxylase. The absence of LDH may be a consequence of the life style which is essentially aerobic. Further enzymes are glucose-6-phosphatase and hexokinase. In

13 general, the metabolism seems not to be centred on carbohydrates. Notable is also the absence of a urea cycle. In DD transporters have been identified that directly dispose of ammonia formed during metabolic processes (Kirsten et al. 2008). Comparisons of individual pathways within the dictyostelids show that the individual components are well conserved although the degree of conservation differs. We can distinguish between well conserved pathways such as gluconeogenesis and the pentose phosphate pathway, and highly conserved ones like glycolysis, citrate cycle, pyruvate and lipid metabolism. A very high degree of conservation is found in nucleotide and phenylalanine metabolism. Enzymes of energy metabolism show a varying degree of conservation among the dictyostelids. All components of the respiratory chain and the ATPase complex are present and can be identified based on homology. Particularly well conserved are however the components of complex II (succinate dehydrogenase) and the ATPase subunits. Similar to yeast, creatine kinase and arginine kinase are missing in dictyostelids as these organisms appear to use neither creatine phosphate nor arginine phosphate to quickly regenerate ATP. Despite the general conservation there are species specific genes primarily in energy metabolism, where we have a wider spectrum of components of complex I, II and IV, and in the nucleotide metabolism which shows some peculiarities. Overall, the number is however limited (Table S5.2). In most cases, however, the gene sets for a particular process in dictyostelids are less complex than in the human and yeast genome. The highest complexity (i.e. high number of isoenzymes) is consistently observed in man with the exception of the phenylalanine metabolism, which stands out in yeast with 23 associated genes versus 5 in man and 4 in the dictyostelids. The high complexity in man is particularly evident for glycolysis, where we have 60 associated genes versus 44 in yeast and only 26 in the dictyostelids, and for purine metabolism with 155 genes in man and 90 in yeast as compared to 71 for the dictyostelids. In the latter case, we have however a set of unique dictyostelid genes that are related to the cAMP metabolism and that have only been added in brackets to the primary list (Table S5.1).

Table S5.1: Number of genes involved in primary metabolism (HS=Homo sapiens; SC=Saccharomyces cerevisiae) Process or Number of pathway genes genes HS SC DD PP DF common Dictyostelid analysed in specific genes pathway

Glycolysis 88 60 4426 24 24 9 1 Glycogen 20 13 8 8 7 7 4 2 metabolism Pentose 57 26 2820 19 19 15 2 phosphate pathway Amino sugar 105 44 2823 21 19 13 3 metabolism Pyruvate 80 40 3222 22 22 14 1 metabolism Fatty acid 49 29 14 15 15 15 3 (none)* synthesis* (61)* (27)* Fatty acid 61 33 1211 10 11 6 (none)* metabolism* (23)* Steroid 29 17 1510 9 9 6 2

14 metabolism Purine 259 (274)1 155 90 71 71 71 53 4 metabolism1 (86)1 (86)1 (86)1 Pyrimidine 165 97 6763 63 62 47 4 Metabolism Arginine 127 66 3422 22 22 9 3 metabolism Phenylalanine 41 5 234 4 4 2 none metabolism Citric acid cycle 50 32 35 31 31 31 22 1 Oxidative phosphorylation Complex I 81 44 4 30 30 30 1 5 Complex II 21 5 9 7 7 7 4 3 Complex III 20 15 13 8 7 8 5 none Complex IV 43 25 13 14 14 14 4 4 ATPase 90 51 3524 23 22 18 1 * includes enzymes for polyketide metabolism 1 includes enzymes involved in cAMP signaling

Table S5.2: Dictyostelid specific genes involved in the primary metabolism Process or pathway Dictyostelid specific enzymes Gene ID (DD only)

Glycolysis Glucokinase DDB_0218308 Glycogen metabolism alpha glucosidase (gaa) DDB0237578 alpha glucosidase II (modA) (DD only) DDB0191113 Pentose phosphate gluconate 2-dehydrogenase DDB_0231445 pathway additional glucose-6-phosphate 1- DDB_0238739 dehydrogenase (g6pd-2) Amino sugar Fructokinase DDB0204621 metabolism galactose-1-phosphate uridylyltransferase DDB0204814 additional beta-hexosaminidase nagA DDB0191256 Pyruvate metabolism acetyl-CoA synthetase DDB_0233947 Steroid metabolism cycloartenol synthase (CAS1) DDB_0191311 delta14-sterol reductase DDB_0232079 Purine metabolism allantoinase DDB_0231352 (allB1) urate oxidase DDB_0231470 purine nucleosidase (iunH) DDB_0231227 DNA polymerase I (polA) DDB_0191131 Pyrimidine Metabolism thymidylate synthase (FAD) (thyA) DDB_0214905 pseudouridylate synthase (pus3) DDB_0231223

15 dCTP deaminase DDB_0230112 dCTP deaminase DDB_0230132 Arginine metabolism ornithine cyclodeaminase DDB0305175 proline iminopeptidase DDB_G0279793 acetylornithine deacetylase DDB_G0267380 Citric acid cycle ATP citrate (pro-S)-lyase DDB_0235361 Oxidative phosphorylation Complex I NADH dehydrogenase (ubiquinone) 1 alpha DDB_G0285783 subcomplex 12 ndufa12 putative NADH dehydrogenase DDB0238855 calcium-binding EF-hand domain-containing DDB0238858 protein, putative NADH dehydrogenase putative NADH dehydrogenase DDB0233209 (ubiquinone), putative NADH-ubiquinone 13 kDa subunit putative NADH dehydrogenase DDB0233223 (ubiquinone), putative NADH-ubiquinone oxidoreductase 8B subunit, mitochondrial ribosome domain-containing protein Dictyostelium discoideum mitochondrial DDB0201590 DNA encodes a NADH:ubiquinone oxidoreductase subunit which is nuclear encoded in other eukaryotes Complex II sdhaf1 = Succinate DeHydrogenase DDB0305170 Assembly Factor 1A sdhaf1 = Succinate DeHydrogenase DDB0305171 Assembly Factor 1B Succinate dehydrogenase subunit 5, Q54B20 mitochondrial Complex IV cytochrome c oxidase subunit I (COX1) DDB0201587 cytochrome c oxidase subunit V DDB0191104 cytochrome c oxidase subunit VIIe DDB0216179 cytochrome oxidase subunit VIIs DDB0216180 ATPase vatM DDB_0216215

5.2 Secondary metabolism Polyketide synthases (PKS) are a family of modular enzymes that play a key role in the production of polyketides, a large set of secondary metabolites in bacteria, plants or fungi. Polyketides have many established roles as antibiotics, fungicides and in defense against predation. In DD, the PKS steely 2, synthesizes the precursor of the Differentiation Inducing Factor, DIF, which triggers the differentiation of basal disk cells (Austin et al. 2006; Saito et al. 2008). Many other PKS enzymes are present in DD, and several other polyketides with developmental roles have been fully or partially identified (Saito et al. 2006; Serafimidis and Kay 2005), while additional roles for PKSs in defense, pigment production, and attraction of food organisms are suspected. We searched the genomes of the dictyostelids for PKSs

16 using Blast based approaches and domain search using interproScan. DD contains by far the most PKS genes indicating a species-specific gene family expansion (Table S5.3). Detection of PKS domains is not straightforward and needs a detailed examination (John et al. 2008; Zucko et al. 2007). Using SearchPKS (http://www.nii.res.in/searchpks.html) we identified the domain architectures of PKS proteins (Table S5.4). The ketide synthase (KS) domains of all identified proteins were aligned and the alignment was used to infer phylogenetic relationships between the proteins (Figure 3, main text). The phylogeny shows that at most four PKSs are fully conserved between DD, PP and DF and that individual species show extensive gene family expansion from single ancestral genes. Examination of the domain architecture of the affected proteins reveals that the expansions occur mainly in two subsets of PKSs that have a specific modular architecture (Table S5.4). The PKS proteins with a full set of domains are prevalent in all species, but DD also has a large gene family in which the ER domain is missing. All other PKS proteins have unique domain structures. Remarkably, domain structures of PKSs can differ even if the position of the KS domain in the phylogenetic tree indicates close relationship. For example in PP the phylogeny of the KS and AT domains indicates a close relationship of PPL_12615 and PPL_12638, while the domain structure of PPL_12615 is the same in a distantly related PKS (PPL_02171). This suggests that domain structures could be reorganized via homologous recombination between PKS genes or by introduction or deletion of domains.

Table S5.3: Abundance of PKS genes in the three genomes DD PP DF PKS Steely type PKS 2 2 2 Other PKS 40 14 14

Table S5.4: Polyketide synthases, domain architecture. Locus tags are coloured black, red and green for DD, DF and PP genes respectively.

DDB0217561, DDB0217614; DFA_11318; DFA_01351; PPL_09118

DDB0219599; DDB0216993

DDB021687; DDB0219595

DDB0235267; DDB0230082; DDB0235176; DDB0235179; DDB0230071; DDB0235183; DDB0235205; DDB0235174; DDB0230076; DDB0230074; DDB0235262; DDB0235263; DDB0235264; DDB0235258; DDB0235230; DDB0235303; DDB0237714; DFA_09704; DFA_11312; DFA_11170; DFA_07699; DFA_12319; DFA_07702; DFA_02351; DFA_09772; DFA_06797; PPL_04464; PPL_04485; PPL_10967; PPL_04466; PPL_00459; PPL_02946

DDB0235219

DDB0202700

17

DDB0235175

DDB0216999; DDB0230081; DDB0230080; DDB0235218; DDB0235216; DDB0237651; DDB0237650; DDB0235304; DDB0235302; DDB0235301; DDB0235300; DDB0235162; DDB0235169; DDB0235164; DDB0231390; DFA_11310; PPL_04217

DDB0230068; PPL_08853

DDB0230079

DDB0235220

DDB0235222

DDB0184430

DDB0230077; DDB0230078

DDB0234164; DFA_08550; PPL_02802; stlA

DDB0234163; stlB

DFA_09640; stlB

PPL_00289; stlB

DFA_07700; DFA_02353

PPL_12638

18 PPL_12615; PPL_02171

PPL_04374

PPL_04222; PPL_10494

5.3 Cell shape, organization and motility

5.3.1 Kinesins Kinesins are motor proteins, which move along . They are involved in mitosis and meiosis and cargo transport. Kinesins in DD were comprehensively described in (Kollmar and Glöckner 2003). The search for kinesins in PP and DF showed that kinesins are almost entirely conserved in the dictyostelids. Only one kinesin (kif14 in Figure S5.1) appears to have been lost in DD. All other kinesins have a one to one orthology relationship to kinesins in the other species (Figure S5.1) and have the same domain architecture (Table S5.5).

19

Figure S5.1: Phylogenetic tree of kinesin domains Proteins were identified by scanning the genomes for kinesin domains. Domains were extracted from the proteins and then aligned. The tree was constructed using the phylip package (1000 replications bootstrapped neighbor joining trees using default values) and visualized using MEGA. The tree was rooted with a human kinesin domain.

20 Table S5.5: Length and domain architecture of kinesin genes in the three species. kif1, kif3, kif5, kif7 are responsible for organelle transport, kif2, kif4, kif6, and kif11-14 are mitotic motors (see (Kollmar and Glöckner 2003) for details). .

protein length Domains DD DF PP DD DF PP kif 1 2205 1302 1948 kinesin (8..356) kinesin (2..399) kinesin (2..368) SMAD/FHA SMAD/FHA SMAD/FHA (454..519) Pleckstrin (467..578) (462..571) Pleckstrin (1524..1618) (1354..1449) kif 2 792 814 721 kinesin (443..782) kinesin (465..812) kinesin (368..710) kif 3 1193 1172 1024 kinesin (9..330) kinesin (117..483) kinesin (2..368) kif 4 1922 2486 ND kinesin (28..344) kinesin (89..492) Prefoldin (1570.1673) kif 5 990 901 883 kinesin (12..331) kinesin (1..310) kinesin (4..337) kif 6 1030 869 1594 kinesin (453..773) kinesin (366..618) kinesin (1020.1356) SAM (4..64) SAM (2..67) SAM (662..741) kif 7 1255 1085 992 kinesin (34..350) kinesin (58..384) kinesin (94..422) kif 8 1873 1522 1617 kinesin (19..414) kinesin (46..261) kinesin (54..401) WD40 repeats WD40 repeats WD40 repeats (1498..1842) (1264..1522) (1455..1617) kif 9 1222 1100 960 kinesin (356..720) kinesin (359..633) kinesin (318..632) kif 10 1238 958 893 kinesin (2..363) kinesin (10..399) kinesin (20..398) kif 11 685 640 586 kinesin (10..406) kinesin (15..385) kinesin (2..398) kif 12 1499 1513 1401 kinesin (132..571) kinesin (145..582) kinesin (108..504) kif 13 1265 1290 1199 kinesin (18..392) kinesin (22..411) kinesin (16..384) kif 14 ND 1245 397 kinesin (7..348) kinesin (20..331)

5.3.2 Myosins Myosins are relatively scarce in the DD genome (Kollmar 2006). Query of the DF and PP genomes with sequences revealed that the myosins are as conserved as the kinesins. All myosins exhibit clear 1:1:1 orthology relationships between species (Figure S5.2, also referred to in Table S5.6).

21

Figure S5.2: Phylogenetic tree of Myosin proteins. The total myosin protein sequences were used for the alignment. The tree was constructed using the phylip package (1000 replications bootstrapped neighbor joining trees using default values) and visualized using MEGA. DF, PP and DD myosins are in red, green and black, respectively.

5.3.3 The microfilament system In general terms the repertoire of microfilament system components of DF and PP is almost a 1:1 replica of that of DD (Table S.5.6). There are however a few notable exceptions that will be commented on. For details on the structure and function of actin proteins in DD the reader is referred to reviews on the topic (Rivero and Eichinger 2005). We have identified 30 and 29 actin and actin-related genes in DF and PP, respectively (Figure S5.3), as well as three putative pseudogenes (recognizable by truncations and frame shifts) in each species. This contrasts with the 42 genes and 7 pseudogenes of the DD actinome (Joseph et al. 2008). All three actinomes encode comparatively similar numbers of different : 19 in DF, 16 in PP and 18 in DD. In DD a group of 17 genes encodes an identical , Act8. Six genes in PP and 4 in DF encode identical proteins that differ from DD Act8 in only 3 or 4 residues, respectively. A few more actins of PP and DF are also very

22 closely related to DD Act8 and Act10. The remaining actin genes encode fairly divergent proteins without direct orthologs in the DD actinome and all three species have the same set of actin related proteins, although with variable degrees of similarity. In all three species, two genes encode proteins that contain actin in combination with other domains. Filactin (fia) comprises a bona fide actin domain preceded by two filamin repeats. Filactins are restricted to Amoebozoa (unpublished) and are apparently absent from higher eukaryotes. The previously unnoticed DDB_G0281783 encodes a protein with a divergent actin domain preceded by an F-box and a prolyl-hydroxylase  subunit homology domain. The F-box domain recruits substrates for ubiquitination in cullin-dependent ubiquitin complexes(Skowyra et al. 1997). This protein is apparently restricted to Dictyostelids. Remarkably, many of the actin genes of DF and PP, including some of the divergent ones, but excluding fia and the DDB_G0281783 orthologs, have a conserved intron position (albeit a different one in each species), whereas DD actin genes are intronless. From this we can confidently conclude that in DF and in PP all conventional actin genes are the result of duplications of a single ancestral gene.

Most of the proteins involved in nucleation of actin filaments are discussed in the section on Rho signaling. DF encodes a second Scar/WASP-related protein and both DF and PP lack an ortholog of the DD WH2-domain containing DDB_G0281723. PP has two genes encoding the G-actin sequestering protein actobindin, as compared to one in DF. In DD actobindins are represented by three genes, two encoding the same protein. There are also some differences in the set of . Compared to the 3 genes of DD, DF has 2 genes and PP 6. Two of the PP profilin genes occupy tandem positions in the genome, and two more are only 5.5 kb apart.

The ADF//cofilin family also differs between all three species. A phylogenetic tree of all ADF domains encoded by the genomes of all three species allows distinguishing two major groups (Figure S5.4). The ADF domains present in cofilin, twinfilin and GMF (glia maturation factor) constitute one group. PP has two genes encoding cofilins, one of them related to DD cofilin 1. DF has only one cofilin 1-related gene. In DD the cofilin subfamily has undergone expansion, as it is represented by 7 genes (two of them as duplications on ) and one pseudogene. The coactosin group comprises 8 genes in DF and PP and 9 genes plus 2 pseudogenes in DD. Their relationships are evident from the phylogenetic tree of (Figure S5.4).

The family includes a novel villin-related gene in DF and PP that has no ortholog in DD. It encodes a protein with a domain architecture similar to the product of vilC. By contrast, both PP and DF seem to lack the gene encoding protovillin (vilB). DF has an additional gene that encodes a short protein consisting of only two gelsolin repeats. The gelsolin repeat is found also in sec23 and sec24 (each encoded by a single gene in all three genomes), two components of the COPII coat involved in vesicle trafficking. It remains to be seen whether this repeat mediates an interaction of the sec22/23 complex with the actin cytoskeleton

Only one of the two annexins of DD is represented in DF and PP, annexin C1. Additionally, one more gene in DF encodes a slightly divergent annexin C1, whereas one more gene in PP encodes an annexin C1-related protein that lacks the characteristic GYPPQ repeats. Among the translation elongation factors with actin-binding properties, both DF and PP have each one single gene encoding an EF1 (two in DD) and two genes encoding EF1 (one in DD) but no genes encoding the shorter EF1-related proteins (two in DD). Table S5.6 lists one DD locus, iliQ, among the genes encoding a protein with a villin head piece domain, although this domain is identifiable only in the PP ortholog.

A family of proteins that has undergone expansion in DF and PP is that of the comitins. Comitins are characterized by a bulb-type mannose specific lectin domain found in plant lectins. Besides the classical comitin encoded by comA there are two more genes encoding comitin-related protein in DD. We have found 6 genes encoding proteins of this family in PP

23 and 5 in DF. None of them can be considered a direct ortholog of any of the DD genes. While DD comitins have only one lectin domain, some of the PP and DF comitins consist of a tandem of two lectin domains. In two proteins, PPL_01920 and PPL_11248, the lectin domain is followed by an SCP domain, a putative calcium chelating serine protease.

In the large family of calponin homology (CH) domain proteins (Friedberg and Rivero 2010) PP has an additional gene (PPL_07592) encoding a fimbrin that displays 68% similarity to the DD fimbrin, lower than that of the more characteristic PPL_04342 (85% similarity). DF has an additional gene, DFA_01046, that encodes a CH protein with no clear affiliation. The gxcZ gene is absent in both DF and PP. As discussed in the section about Rho signaling, this gene is probably the result of a duplication of gxcD that took place only in DD.

Two proteins containing Ig_FLMN domains, filamin (gelation factor, ABP120), and filactin were known previously and are present in all three genomes. They harbor 6 and 2 Ig_FLMN domains, respectively. In the course of this analysis we have identified four proteins with a single Ig_FLMN domain each that are also present in all three species. One protein (DDB_G0283983/PPL_10139/DFA_11749) contains an N-terminal Ig_FLMN domain and a C-terminal HECT domain. This last domain is a ubiquitin ligase domain. Proteins with the same domain architecture can be found in insects and chordates, but there is no information on their roles. DDB_G0289585 (PPL_07596/DFA_11699) consists of leucine-rich repeats and a C-terminal Ig_FLMN domain. DDBG_0268876 (PPL_04859 + PPL_04858/DFA_07612) contains an N-terminal Ig_FLMN domain and a C-terminal tyrosine kinase domain. DDB_G0289585 (PPL_08611/DFA_09865) has one Ig_FLMN domain as the only recognizable domain. The last three proteins are apparently unique to the Dictyostelids. Whether the single Ig_FLMN domain of these proteins is involved in any interaction with the actin cytoskeleton needs further investigation.

Finally, we have failed to identify representatives of two families of actin-binding proteins in DF and PP, namely hisactophilin and ponticulin. These are relatively short proteins and the possibility exists that their respective genes have escaped automated prediction.

Table S5.6: The microfilament system in dictyosteliidae DF PP Proposed nameDD orthologous Domain genes Actins DFA_05968 Actin act8/act10 ACT DFA_08074 Actin act8/act10 ACT DFA_11706 Actin act8/act10 ACT DFA_11773 Actin act8/act10 ACT DFA_11894 Actin act8/act10 ACT DFA_11631 Actin act8/act10 ACT DFA_06060 Actin act8/act10 ACT DFA_08471 Actin ACT DFA_06454 Actin ACT DFA_02577 + Actin ACT DFA_02578 DFA_06455 Actin ACT DFA_06456 Actin ACT DFA_09957 Actin ACT DFA_02204 Actin ACT

24 DFA_03078 Actin ACT DFA_04728 Actin ACT DFA_11064 Actin ACT DFA_00544 Actin ACT DFA_07273 Actin ACT DFA_09958 Actin ACT DFA_08517 (ps) ACT DFA_06453 + ACT DFA_06454 (ps) DFA_08342 + ACT DFA_08343 (ps) PPL_02635 Actin act8/act10 ACT PPL_02661 Actin act8/act10 ACT PPL_05120 Actin act8/act10 ACT PPL_05444 Actin act8/act10 ACT PPL_12579 Actin act8/act10 ACT PPL_12292 Actin act8/act10 ACT PPL_11090 Actin act8/act10 ACT PPL_06386 Actin act8/act10 ACT PPL_02276 Actin ACT PPL_11109 Actin ACT PPL_02307 Actin ACT PPL_06016 Actin ACT PPL_07477 Actin ACT PPL_00356 Actin ACT PPL_04879 Actin ACT PPL_01284 Actin ACT PPL_01930 Actin ACT PPL_07593 Actin ACT PPL_09857 Actin ACT PPL_09726 + ACT PPL_09727 (ps) PPL_11012 + ACT PPL_11013 (ps) PPL_11088 + ACT PPL_11089 (ps) DFA_02204 PPL_04879 Filactin fia ACT, Ig_FLMN DFA_07689 PPL_00836 DDB_G0281783 ACT Actin-related proteins DFA_11064 PPL_05006 Arp1 arpA ACT DFA_04728 PPL_02640 Arp2 arpB ACT DFA_01271 PPL_06567 Arp3 arpC ACT

25 DFA_09828 PPL_03353 Arp4 arpD ACT DFA_10876 PPL_08620 Arp5 arpE ACT DFA_07539 PPL_06793 Arp6 arpF ACT DFA_10279 PPL_02817 Arp8 arpG ACT DFA_08075 PPL_07579 Arp11 arpH ACT Nucleators and WH2 domain- containing proteins DFA_07896 PPL_04141 Formin A forA FH2 DFA_05548 PPL_09405 Formin B forB FH2 DFA_08359 PPL_01453 Formin C forC FH2 DFA_09767 PPL_00374 Formin D forD FH2 DFA_09074 PPL_03994 Formin E forE FH2 DFA_11177 PPL_01049 Formin F forF FH2 DFA_11932 PPL_06879 Formin G forG FH2 PPL_03084 Formin H forH FH2 DFA_12049 PPL_01192 Formin I forI FH2 DFA_03037 PPL_09916 Formin J forJ FH2 DFA_12139 PPL_02172 RasGEF L gefL FH2 DFA_06072 PPL_11153 Actobindin abnA and WH2 abnB/abnC PPL_06976 Actobindin abnA and WH2 abnB/abnC DFA_05359 PPL_03078 Scar/WASP- DDB_G0292878 WH2 related DFA_09826 Scar/WASP- DDB_G0292878 WH2 related DFA_10027 PPL_03619 WH2 and SH3 DDB_G0269708 WH2 containing DFA_08469 PPL_11259 Scar scrA WH2 DFA_03864 PPL_12528 WASP-1 wasA WH2 DFA_02106 PPL_01075 WASP-2 DDB_G0272811 WH2 DFA_01862 PPL_02409 WIPa wipA WH2 DFA_03656 PPL_10248 WH2 and LIM limD WH2, LIM containing DFA_04553 PPL_06656 WH2 and FYVE slob1 WH2 containing prot. kinase DFA_08381 PPL_07983 WASP-related 2 DDB_G0283827 WH2 DFA_11228 PPL_01522 VASP vasP WH2 DFA_01179 PPL_11231 CARMIL carmil DFA_12362 PPL_08972 PIR121 pirA DFA_06509 PPL_07173 Nap1 napA DFA_00342 PPL_05945 Abi2 abiA

26 DFA_04750 PPL_06564 HSPC300 hspc300 DFA_10263 PPL_09936 ARPC1/p41-Arc arcA DFA_01608 PPL_12587 ARPC2/p34-Arc arcB DFA_12352 PPL_12524 ARPC3/p21-Arc acrC DFA_02729 PPL_08558 ARPC4/p20-Arc arcD DFA_10843 PPL_05673 ARPC5/p16-Arc arcE Profilins DFA_10759 PPL_03008 proA PRO DFA_08065 PPL_08338 Profilin 2 proB PRO PPL_09347 Profilin 1b proA PRO PPL_09348 Profilin 1c proA PRO PPL_08342 Profilin none PRO PPL_04133 Profilin none PRO Cofilin family DFA_07056 PPL_09568 Cofilin 1 cofA/cofB ADF PPL_04258 Cofilin none ADF DFA_02155 PPL_06671 Twinfilin twfA ADF DFA_11594 PPL_04870 Glia maturation gmfA ADF factor DFA_09301 PPL_10856 Coactosin coaA ADF DFA_00557 PPL_02804 Coactosin-related DDB_G0277615 ADF DFA_05931 PPL_10501 Coactosin-related DDB_G0237405_p ADF s/DDB_G0273569 _ps DFA_11751 PPL_10137 Coactosin-related DDB_G0283837 ADF DFA_04383 PPL_11971 Coactosin-related DDB_G0290593 ADF DFA_01115 PPL_01791 Coactosin-related DDB_G0270134 ADF and DDB_G0270132 DFA_03487 PPL_00947 Coactosin-related DDB_G0295683 ADF, LIM LIM-containing DFA_07451 PPL_01504 Abp1 abpE-1/abpE-2 ADF Gelsolin family DFA_08590 PPL_07900 Villidin vilA WD, GEL, VHP DFA_02121 PPL_09589 + Villin-related vilC GEL, VHP PPL_09588 DFA_03296 PPL_07974 Flightless/villin- vilD GEL, VHP related DFA_10993 PPL_08041 Villin-related none GEL, VHP DFA_11169 PPL_11301 Severin sevA GEL DFA_01370 PPL_00388 GRP125 gnrA GEL DFA_04039 PPL_01103 Gelsolin-related gnrB GEL DFA_02285 PPL_06997 Gelsolin-related gnrC GEL DFA_08675 none GEL

27 I/LWEQ domain containing proteins DFA_10890 PPL_11331 Sla2/HIP-1 hipA I/LWEQ DFA_12116 + PPL12363 A (Filopodin) talA I/LWEQ DFA_12551 DFA_09332 PPL_08757 Talin B talB I/LWEQ, VHP Calponin homology domain family DFA_08313 PPL_04342 Fimbrin-1 fimA 4xCHf PPL_07592 Fimbrin-1b fimA 4xCHf DFA_04405 PPL_06198 Enlazin/Fimbrin-2 enlA/fimB 4xCHf DFA_09018 PPL_05717 Fimbrin-3 fimC 4xCHf DFA_07856 PPL_12620 Fimbrin-4 fimD 4xCHf DFA_06651 PPL_07021 Fimbrin-related frpA 2xCHf protein DFA_07026 PPL_06970 Fimbrin-related DDB_G0275731 2xCHf RasGAP DFA_03151 PPL_04640 Interaptin abpD CH1-CH2 DFA_01474 PPL_01713 Filamin abpC CH1-CH2, Ig_FLMN DFA_04278 PPL_04382 -actinin abpA CH1-CH2, SPEC DFA_02967 PPL_10148 Cortexillin 1 ctxA CH1-CH2 DFA_10888 PPL_11332 Cortexillin 2 ctxB CH1-CH2 DFA_09244 PPL_10792 + DDB_G0294557 CH1-CH2 PPL10791 DFA_00535 PPL_10742 GxcCC gxcCC CH1-CH2 DFA_03442 PPL_09217 DDB_G0287875 CH1-CH1 DFA_06820 PPL_10150 NAV/unc-53- DDB_G0284113 CH1 related DFA_00709 PPL_05227 GxcAA gxcAA CH1 DFA_11842 PPL_12167 GxcDD gxcDD CH1 DFA_10144 PPL_05538 DDB_G0280565 CH2 DFA_11030 PPL_04695 DDB_G0283441 CH2 DFA_08919 PPL_00308 Smoothelin-related DDB_G0272472 CH2 DFA_01456 PPL_04203 DDB_G0277777 CH2 DFA_01582 PPL_02252 DDB_G0270220 CH3 DFA_03656 PPL_01620 CH-LIM ChLim CH3 DFA_07682 PPL_10611 Trix, GxcB gxcB CH1-CH3-CH3 DFA_07653 PPL_07748 GxcD gxcD CH3, VHP DFA_05521 PPL_09196 GxcBB gxcBB CH3 DFA_12166 PPL_11514 RacGEF1, GxcA gxcA CH3 DFA_04829 PPL_10606 RasGEF P gefP CH3 DFA_01526 PPL_03632 PAK D pakD CH3

28 DFA_11087 PPL_07783 Gas2-related DDB_G0289033 CH3 DFA_05725 PPL_04872 Transgelin-related DDB_G0233800 CH3 DFA_05423 PPL_06099 DDB_G0269824 CH3 DFA_06163 PPL_09479 DDB_G0267974 CH3 DFA_05288 PPL_07006 DDB_G0282859 CH3 DFA_01046 none CH3 DFA_10665 PPL_06943 EB1 eb1 CHe Myosins DFA_03722 PPL_01742 Myosin II mhcA MYO DFA_02481 PPL_07637 Myosin IA myoA MYO DFA_11711 PPL_08885 Myosin IB myoB MYO DFA_03309 PPL_09000 Myosin IC myoC MYO DFA_10817 PPL_05446 Myosin ID myoD MYO DFA_03539 PPL_09899 Myosin IE myoE MYO DFA_08671 PPL_00951 Myosin IF myoF MYO DFA_11323 + PPL_02557 Myosin N myoG MYO DFA_11322 DFA_09266 PPL_07496 Myosin H myoH MYO DFA_03162 PPL_11051 Myosin VII myoI MYO DFA_06385 PPL_11975 Myosin J myoJ MYO DFA_08491 + PPL_01622 Myosin IK myoK MYO DFA_08490 DFA_10821 PPL_05993 Myosin M myoM MYO Varia DFA_04446 PPL_02337 Coronin corA WD DFA_06579 PPL_06027 Coronin 7 corB WD DFA_10596 PPL_03685 CAP cap DFA_06985 PPL_12353 Cap32 acpA DFA_11606 PPL_11690 Cap34 acpB DFA_09731 PPL_00946 Aip wdpA/DAip1 WD DFA_08405 PPL_03004 ABP34 abpB DFA_07011 PPL_11222 Dynacortin dct DFA_09048 Comitin-related comA, cmr and DDB_G0295801 DFA_09514 Comitin-related comA, cmr and DDB_G0295801 DFA_09428 PPL_00174 Comitin-related comA, cmr and DDB_G0295801 DFA_01457 Comitin-related comA, cmr and DDB_G0295801 DFA_00517 Comitin-related comA, cmr and DDB_G0295801 PPL_10748 Comitin-related comA, cmr and DDB_G0295801

29 PPL_03184 Comitin-related comA, cmr and DDB_G0295801 PPL_06482 Comitin-related comA, cmr and DDB_G0295801 PPL_11248 Comitin-related comA, cmr and DDB_G0295801 PPL_01920 Comitin-related comA, cmr and DDB_G0295801 DFA_05867 PPL_03474 Annexin VII (C1) nxnA DFA_07722 Annexin VII (C1)-2 nxnA PPL_08292 Annexin VII (C1)- nxnA related DFA_00762 PPL_02955 A vinA DFA_06130 PPL_03377 Vinculin-related vinB DFA_04473 PPL_01687 Elongation factor efaAI IA DFA_01679 PPL_07276 Elongation factor efa1B IB DFA_07410 PPL_08792 Elongation factor efa1B IB-2 DFA_01762 PPL_01438 MHCKA mhkA DFA_00533 PPL_10754 Kelch-related DDB_G0279093 DFA_03062 PPL_06860 Kinesin 5 kif5 DFA_04353 PPL_05121 iliQ VHP DFA_03465 PPL_00415 DDB_G0278025 VHP DFA_03056 PPL_06917 DDB_G0288031 VHP DFA_03886 PPL_05698 DDB_G0279449 VHP DFA_02227 PPL_01076 LimC limC LIM DFA_12107 PPL_11260 LimD1 limD1 LIM DFA_06751 PPL_03431 LimE limE LIM ACT, actin; ADF, actin depolymerizing factor/cofilin-like domain; CH, calponin homology domain (subtypes are indicated); FH2, formin homology domain 2; GEL, gelsolin repeat; Ig_FLMN, filamin type immunoglobulin domain; I/LWEQ, a particular F-actin-binding domain; LIM, Zinc-binding domain present in Lin-11, Isl-1, Mec-3; MYO, myosin ATPase domain; PRO, profilin; SPEC, spectrin repeat; VHP, villin head piece; WD, WD40 repeat; WH2, Wiskott Aldrich syndrome homology region 2.

30

Figure S5.3. The actins and actin-related proteins of DF, PP, and DD. The phylogenetic tree was constructed using the neighbor joining method. DF, PP and DD actins are in red, green and black, respectively. Gene identification numbers are given for DF and PP. The scale bar represents amino acid substitutions per site. Note that in DD. discoideum 17 genes encode proteins identical to actin 8. In PP and DF 6 and 4 genes, respectively, encode identical actins.

31

Figure S5.4. ADF domains in DF, PP, and DD. The phylogenetic tree was constructed using the neighbour joining method. DF, PP and DD proteins are in red, green and black, respectively. Gene identification numbers are given for DF and PP. For DD non annotated genes are given as gene identification numbers. Only the sequences of the ADF domains were used. Domains were aligned separately in those proteins consisting of duplications or triplications of the domain. The scale bar represents amino acid substitutions per site. Asterisks denote genes present in two copies in the DD genome as the result of the chromosome 2 duplication in the sequenced strain. DD cofA and cofB encode an identical protein but cofB is apparently not expressed. The two major groups of ADF domains are indicated by different colors of the tree branches and the domain architecture schemes.

5.4 Gene transcription The modulation of gene transcription in terms of timing and intensity is the terminal end of several signal transduction chains. The previous analysis of the DD transcription factors (TFs) revealed that this organism has only a meager set of such proteins. Moreover, a certain class of TFs (bHLH) is entirely missing from the DD genome. The same is true for the DF and PP genomes indicating an early loss of this class in this evolutionary lineage. Recently, a TF connected to STAT signaling, cudA, was identified (Yamada et al. 2008), which has no assigned Interpro domain. We found an ortholog to this factor in all species, confirming its universal importance for this clade. We screened the genomes for domains signifying TFs (Table S5.7). Since several different domains are connected to similar functions we show in this table not only the broadest descriptive domain terms (like e.g. p53- like transcription factor) but also several more specific subclasses within the broader class (the secondary IPs) or classes connected to others with the highest number of members. Intriguingly, most domains are represented in equal numbers in all genomes. This indicates that all species have the same transcriptional regulatory circuits. Yet, if we look at orthologous relationships between TFs of the different species we see that often only a part of the TFs have clear orthologs in other genomes. In some cases like the STAT TFs all 4 members belong to the core set (100 % in the last column of the table) of orthologous

32 proteins. These TFs are therefore well conserved and are likely to play key roles in transcriptional regulation. Other TFs have zero members in the core set. Thus, in these cases the domain is conserved but the protein as a whole is not. For example, IPR012890 is represented by one protein in each genome, but we could not detect an orthologous relationship between them. Since this particular domain is responsible for binding GC rich sequences and the GC content of the different genomes varies widely, we conclude that the rest of the protein was adapted to cope with the altered nucleotide composition. From this analysis it also becomes clear that our lack of detecting orthologous genes is very likely not due to introduction of a large set of novel invented genes in a genome, but to highly altered proteins due to mutation.

Table S5.7: specific DNA binding transcription factor domains secondary core set % in IPR IPR Description DD DF PP proteins core set IPR011598 Helix-loop-helix DNA-binding 2 2 2 1 50 GC-rich sequence DNA-binding IPR012890 factor-like 1 1 1 00 p53-like transcription factor, IPR008967 DNA-binding 10 9 8 660 IPR007888 NDT80/PhoG like DNA-binding 3 2 3 1 33 IPR015988 STAT_TF_coiled-coil 4 4 4 4 100 STAT transcription factor IPR015347 homologue, coiled coil 3 3 3 3 100 IPR004181 Zinc finger, MIZ-type 5 2 1 0 0 IPR000967 Znf_NFX1 2 3 4 2 100 IPR000679 Zn-finger, GATA type 23 23 24 3 13 IPR009349 Zinc finger, C2HC5-type 1 1 1 1 100 IPR009057 Homeodomain-like 52 46 47 18 34 IPR012287 Homeodomain-related 28 21 23 10 35 IPR001356 Homeobox 16 8 10 3 19 IPR017970 Homeobox_CS 4 2 4 1 25 IPR001005 SANT domain 32 35 37 17 53 IPR014778 Myb, DNA-binding 28 30 29 15 54 Myb-type HTH DNA-binding IPR017930 domain 18 16 18 1056 IPR017877 MYB-like 12 13 12 6 50 IPR003347 Transcription factor jumonji, jmjC 13 10 8 8 62 IPR013129 Transcription factor jumonji 5 4 3 3 60 IPR003349 Transcription factor jumonji, JmjN 1 1 1 1 100 Basic-leucine zipper (bZIP) IPR004827 transcription factor 17 16 16 10 59 IPR011616 bZIP transcription factor, bZIP_1 15 9 7 8 53 Transcription factor CBF/NF- IPR003958 Y/archaeal histone 8 8 8 5 63 CCAAT-binding transcription IPR001289 factor, subunit B 1 1 1 0 0 IPR002100 Transcription factor, MADS-box 4 4 6 2 50 IPR003657 DNA-binding WRKY 1 3 1 0 0 IPR003441 No apical meristem NAM protein 1 1 2 0 0 Fungal transcriptional regulatory IPR001138 protein, N-terminal 3 3 3 1 33 IPR000232 Heat shock factor 1 1 1 1 100

33 5.5 Cell adhesion Adhesion is responsible for attachment to the during movement, to particles during phagocytosis and for formation of intercellular contacts. The requirements during these processes are different, making it necessary that different molecules are involved. In the dictyostelids we have species specific molecules that are required for cell cell adhesion and kin recognition such as the contact site A (csA) and the tgr gene products lagB and lagC of DD which have no obvious homologues in PP and DF, whereas the cadherin like cadA which mediates Ca2+ dependent adhesions at the onset of aggregation is well conserved. In particle adhesion, which is essential for phagocytosis, both, species specific proteins and proteins belonging to conserved protein families are involved. Cytoplasmic proteins linking the plasma membrane proteins to the actin cytoskeleton are similar to those in higher eukaryotes. In general, proteins from PP and DF have higher homologies among each other than to the DD counterparts.

Several gene products involved in adhesion to a substrate or to particles are highly conserved among eukaryotic cells. Among these are talin, vinculin and paxillin which provide a link of the cytoskeleton to plasma membrane receptors. All three Dictyostelids harbour a TalinA and TalinB gene, a paxillin gene and a vinculin gene. The proteins are highly conserved among the species with Talin B being the most conserved protein. Talins have a characteristic N-terminal FERM domain (Band 4.1 domain, ILWEQ motif) that binds to F- actin, an elongated rod domain and in case of Talin B a villin head piece. Vinculin is distinguished by vinculin domains, and paxillin has several Lim domains. Talin can bind to the C-terminus of SibA (similar to integrin) which most likely represents the DD homolog of integrin. DD has four homologs of SibA, SibB to SibE which are all capable of interacting with Talin (Cornillon et al. 2006). They all have a VWA domain (von Willebrand factor (vWF) type A domain) and have been characterized as type 1 transmembrane proteins. PP and DF have six and four Sib related proteins, respectively. There is no clear categorization possible as we observed ~30% identity to either one of the DD Sib proteins. SadA was described as important phagocytosis receptor in DD. It is a nine transmembrane domain protein of 952 amino acids which is essential for substrate adhesion (Fey et al. 2002). SadA is unique to DD, whereas another type of proteins, the nine transmembrane Phg1 proteins, Phg1A, 1B and 1C, are well conserved, as for example the Phg1A proteins, with around 60% identity with the DD protein. The Phg proteins are members of the EMP70 endomembrane protein 70 superfamily which has representatives in plants and animals. We noted again that for all the proteins the identity among the PP and DF representatives in general is higher than between the PP or DF proteins and their DD homologues.

DD has been a model organism for studies of cell cell adhesion. Its multicellular life style relies on the ordered expression of several adhesion systems. Three such systems have been molecularly characterized. An early Ca2+ dependent adhesion system mediated by DDCad-1, is followed by the Ca2+ independent adhesion system for which csA/gp80 is responsible. In the post aggregation stage lagC/gp150 is expressed and mediates Ca2+ independent contacts. DDCad shows limited homologies with classical cadherins. It is a unique adhesion molecule as it lacks a hydrophobic signal peptide and a transmembrane region (Wong et al. 1996). It is synthesized as a soluble protein and transported to the cell surface using a novel mechanism that includes the contractile vacuole (Sesaki et al. 1997). The structure of DDCad has been solved. It contains two beta-sandwich domains, belonging to the betagamma-crystallin and immunoglobulin fold classes exhibiting thus a novel global fold of the two-domain DDCAD-1 (Lin et al. 2006). The domain is designated as DUF1881 domain which was identified in a set of hypothetical bacterial and eukaryotic proteins as well as in various calcium-dependent cell adhesion molecules such as the Cad proteins. DD harbors a closely related protein, DDCad- 3 (72% identity), which was not further characterized so far and the less conserved Cad2 with a DUF1881 domain (28.9% identity). Two Cad homologs are present in PP, DF has at least five proteins harbouring a DUF1881 domain. csA/gp80, one of the first cell adhesion molecules to be identified, belongs to the

34 immunoglobulin superfamily of cell adhesion molecules. It carries a signal peptide and is held in the plasma membrane by a phospholipid anchor (Stadler et al. 1989). It is unique to DD and has no homologs in PP and DF. This is also the case for lagC/gp150. The post aggregation stage cell adhesion molecule lagC/gp150 shows some homology to csA and is also a member of the immunoglobulin superfamily harbouring an IPT/TIG domain. It contains a signal peptide, a single transmembrane domain and a short cytoplasmic sequence. There are two more family members present in the DD genome. The proteins have a role in kin discrimination, and lagB1 and lagC1 are highly polymorphic in natural populations (Benabentos et al. 2009). Adherens junction-like structures have been detected during late stage development when the fruiting body is formed (Grimson et al. 2000). They involve - and - like molecules that have been identified by database searches. The cell adhesion molecule they associate with is not yet known. - and -catenin are present in a single copy in all three species and exhibit high homologies among each other.

Table S5.8: Cell adhesion proteins in dictyostelids Proposed function Name Functional DD PP DF domains Cell cell adhesion cadA DUF 1881 DDB_G0285793 DUF 1881, DFA_06924 XTALbg Cad1 DUF 1881, DFA_07420 XTALbg, CLECT, EGF, TM Cad2 DUF1881 DDB_G0275439 DFA_09595 Cad3 DUF1881 DDB_G0285817 PPL_07647 DFA_04152 DFA_06924 DFA_09536

Cell cell adhesion, AarA; F-box, DDB_G0288877 PPL_10261 DFA_10269 signaling -catenin armadillo repeats

Cell substrate Vinculin- Vinculin DDB_G0281499 adhesion, motility related, domains, -catenin SMC Vinculin PPL_03377 DFA_10269 domains Paxillin Lim DDB_G0274109 PPL_02599 DFA_00624 domains Vinculin, Vinculin DDB_G0285939 PPL_02955 DFA_00762 vinA domains

Cell substrate Talin TalA Band 4.1 DDB_G0290481 PPL_12363 DFA 12115, adhesion domain, DFA 12116 ILWEQ motif

35 Morphogenesis TalB Band 4.1 DDB_G0287505 PPL_08757 DFA_09332 domain, ILWEQ motif, villin headpiece

Cell substrate Phg1A 9 (10) TM DDB_G0267444 PPL_10513 DFA_06898 adhesion, domains phagocytosis EMP70 Phg1B EMP70, DDB_G0277273 PPL_10355 DFA_09645 TM domains Phg1C TM DDB_G0290159 PPL_05378 DFA_05891 domains EMP70

Cell substrate sibA VWA, TM DDB_G0287363 adhesion, phagocytosis sibB TM, EGF- DDB_G0288103 like, VWA sibC EGF, VWA DDB_G0288195 sibD DDB_G0288197 sibE TM, EGF- DDB_G0288239 like, VWA sib EGF, PPL_11489 DFA_08669 VWA, TM PKD, EGF, PPL_11490 VWA, TM EGF-like, PPL_00411 DFA_08670 EGF, VWA EGF, EGF- PPL_11493 DFA_08089 like, TM PPL_11395 PPL_11435 DFA_08088 DUF1881: Protein of unkown function. This domain is found in a set of hypothetical bacterial and eukaryotic proteins, as well as in various calcium-dependent cell adhesion molecules. XTALbg: Beta/gamma crystallins. CLECT: calcium-dependent carbohydrate binding module. EMP70: endomembrane protein 70 superfamily. F-Box: conserved structural motif which is present in numerous proteins and serves as a link between a target protein and a ubiquitin- conjugating . I/LWEQ motif: F-actin binding function. SMC: (structural maintenance of chromosomes. VWA: von Willebrand factor (vWF) type A domain. VWA domains in extracellular eukaryotic proteins mediate adhesion. EGF: Epidermal growth factor. The EGF motif is found frequently in nature, particularly in extracellular proteins. TM: transmembrane domain.

36 5.6 Cell signaling

5.6.1 Seven transmembrane domain receptors including G-protein coupled receptors G-protein coupled receptors (GPCRs) or 7 transmembrane domain receptors (7TMDRs) are found in all eukaryotes and constitute 3-5 % of all genes in vertebrates. These receptors detect a variety of extracellular signals and transduce them, generally via heterotrimeric G- proteins, to effector proteins inside the cell and thus elicit a physiological response. They are characterized by an N-terminal extracellular domain, a C-terminal intracellular domain and a core domain containing seven transmembrane regions. The 7TMDRs are subdivided into six major families that, aside from their conserved secondary domain structure, do not share significant sequence similarity across families. Family 1 includes the -adrenergic, light and odorant receptors, family 2 receptors are activated by large peptides like glucagon or secretin, family 3 comprises the metabotropic glutamate receptors, Ca2+-sensing receptors, a group of putative pheromone receptors coupled to G0 and the GABA(B) receptors, family 4 contains pheromone receptors associated with Gi, family 5 includes the frizzled and smoothened receptors involved in embryonic development and family 6 the DD cAMP receptors (Bockaert and Pin 1999). The original analysis of the DD genome revealed the surprisingly high number of 55 such receptors, approximately 0.5% of the encoded genes (Eichinger et al. 2005). Besides the four well-studied cAMP receptors (cAR) the genome encodes eight additional cAMP receptor-like (crl) proteins and one of these, rpkA, is distinguished by a novel domain structure. Furthermore, there are one secretin- or family 2-like receptor, 17 GABA(B)-like (family 3) and 25 frizzled-like receptors (family 5) of which 9 do not possess an N-terminal cystein-rich domain which is considered to be an essential feature of members of the frizzled/smoothened family. These were named frizzled/smoothened-like sans CRD (fsc receptors). Surprisingly, analysis of the PP and DF genomes led to the detection of two additional groups of GPCRs which were not noticed before and are distantly related to GPR89 and transmembrane protein 145 from man. Analysis of the DF, (group 1) and PP (group 2) genomes revealed the presence of the same major families and orphan members of GPCRs as in DD, however, the total number was significantly reduced, due to mainly differences in family 3 and 5 receptors. The total number of DD GPCRs was found to be 61, while PP and DF encode 42 and 38 such receptors, respectively. Thus, there seems to be an evolutionary trend in the Dictyostelids towards a higher number of GPCRs in group 4. In addition, there are also differences between the three Dictyostelids in each of the four major GPCR families, the cAR/Crl, the secretin, the GABA(B) and the frizzled/smoothened family. In the cAR/Crl family the DD genome encodes four cAMP receptors, while there appear to be only three in DF and two in PP. Based on the evolutionary tree, the LCA apparently only had one gene encoding a cAMP receptor. This gene duplicated once in PP and related taxa (Kawabe et al. 2009), twice in DF and three times in DD (Fig. S5.5A). In contrast, we find 1:1:1 orthologous relationships for all Crl receptors except CrlF, which was either lost in DF, or appeared after the split of group 1 organisms; indicating that the LCA already encoded almost all Crl receptors which were found in DD. For some Crls we see species-specific expansion in DF and/or PP (Fig. S5.5). For family 2 it is noteworthy that PP possesses three latrophilin receptor-like (lrl) paralogues while DD and DF possess only one each (Fig. S5.5B). The common ancestor likely encoded two such receptors of which one was lost in DD and DF. Interestingly, DD encodes a truncated receptor which is probably a pseudogene and most closely related to lrlA (Prabhu and Eichinger 2006). The biggest differences between the three Dictyostelids are apparent in GABA(B) (family 3) and frizzled/smoothened (family 5) receptors. In comparison to DD there is an 18% reduction in family 3 receptors in PP and a nearly 60% decrease in DF. With exception of GrlE, which binds GABA and induces terminal differentiation of DD (Anjard and Loomis 2006), GrlL and GrlP, where we see 1:1:1 orthologous relationship, we find mainly species-specific expansions of different subgroups in the three organisms (Fig. S5.5C). For family 5 receptors the differences in numbers are even more dramatic. Neither PP nor DF appear to encode frizzled/smoothened-like receptors sans CRD (fsc, subfamily 5) of which 9 are encoded in the DD genome (Fig. S5.5D). For many of the receptors we see DD-specific

37 duplications and we do not find clear 1:1:1 orthologous relationships for any of the DD encoded receptors. However, there are many 1:1 orthologues between DF and PP indicating that expansion of this family occurred in group 4 organisms through multiple duplications. This conclusion is supported by the strong increase of family 5 receptors in DD, by 78% in comparison to DF and by 312% as compared to PP. A surprising result of the analysis was the discovery of additional putative GPCRs in the DD genome Fig. S5.5E). One receptor (DDB_G0283855), of which there is one gene in each of the three Dictyostelids, is highly conserved across eukaryotes and a member of the orphan vertebrate GPR89 group. In Arabidopsis thaliana there are two such receptors, GTG1 and GTG2 (GPCR-type G protein 1 and 2) and recently it was reported that they constitute abscisic acid receptors (Pandey et al. 2009). It should be noted, that the DD receptor is only 32% identical to GTG1 and is with around 40% identity most closely related to the corresponding vertebrate receptors of which the function is unknown. The second group of previously unnoticed putative GPCRs is related to the human transmembrane protein 145 (accession: NP_775904) whose function is also unknown. These receptors are only moderately conserved, e.g. DDB_G0291340 has 23% sequence identity to NP_775904. Again these receptors expanded in DD and we could find five in DD, two in PP and three in DF (Fig. S5.5). In summary, all of the GPCR families in the three Dictyostelids were already present in the common ancestor app. 600 Mio years ago. The repertoire of 7TMDRs in the Dictyostelids is very diverse and we noted many cases where an ancestral gene has either been expanded or lost independently in one of the three species highlighting the need to adapt to an ever changing environment. The data point to an evolutionary trend in the Dictyostelids towards a higher number and larger diversity of GPCRs in group 4 organisms. Interestingly, only the GPCRs are diverse, the following links in the signal transduction chain are nearly completely conserved in all species (Fig. S5.6, S5.7; S5.8, Table S5.9). Thus, signals are received by diverse sets of receptors but transduced by a common set of heterotrimeric G proteins and downstream components. This indicates that multiple diverged receptors activate the same signal transduction pathways.

38

Figure S5.5: The repertoire of DD, PP and DF 7TMDRs including GPCRs. Evolutionary trees for the cAR/Crl (A, family 6), the secretin-like (B, family 2), the GABA(B) receptor-like (C, family 3), the frizzled/smoothened-like families (D, family 5) and the orphan

39 receptors related to human transmembrane protein 145 and gpr89 (E) are shown. CLUSTALX alignments of the sequences encompassing the seven transmembrane domains of DD, PP and DF GPCRs were used to create dendrograms with the TreeView program. Bootstrap values are provided at the node of each branch. The scale bar indicates amino acid substitutions per site. Locus tags are provided for DF and PP and dictyBase identifiers for DD receptors. *Less than seven transmembrane domains were used due to errors in the gene prediction. **Two gene models code for one receptor and need to be fused correctly. ***One gene model codes for two or more genes and needs to be split correctly.

Table S5.9: Heterotrimeric G protein domains found in dictyostelids

G protein DD PP DF subunits IPR015898 1 1 1 G-protein, gamma-like subunit IPR001632 2 2 2 G protein beta subunit IPR001019 14 14 13 G protein alpha subunit

5.6.2 Cyclic nucleotide synthesis, detection and degradation The ubiquitous role of cAMP in coordination of cell movement, morphogenesis and cell differentiation is one of the most striking aspects of DD development. As an intracellular messenger for other stimuli, cAMP triggers the growth to development transition, the maturation of stalk and spore cells, and it also keeps spores dormant under conditions unfavourable for growth. Secreted cAMP pulses coordinate cell movement during aggregation and fruiting body formation and upregulate expression of aggregation genes. Secreted cAMP also induces expression of prespore genes and inhibits expression of stalk genes (Aubry and Firtel 1999) (Saran et al. 2002). The related molecule cGMP has a more restricted role as intermediate in one of four pathways that mediate the chemotactic response (Veltman et al. 2008). Recent studies of functional conservation of adenylate cyclase G (ACG) and cAMP receptors (cARs) across the Dictyostelid phylogeny provided clues how the divergent roles of cAMP may have evolved. The role of intracellular cAMP in induction of spore formation and maintenance of spore dormancy is derived from a role as signalling intermediate for stress- induced encystation and stress-inhibited excystation in solitary amoebas (Ritchie et al. 2008). The chemoattractant function of cAMP during aggregation of group 4 taxa appeared to be derived from a similar role in coordinating fruiting body morphogenesis in all Dictyostelia (Alvarez-Curto et al. 2005; Kawabe et al. 2009). The deepest ancestral role of secreted cAMP was probably to direct aggregated cells to form spores and not cysts (Kawabe et al. 2009). Much deeper insight into the mechanisms by which elaboration of cAMP signalling generated phenotypic complexity during Dictyostelid evolution is yet to be uncovered. The completed DF and PP genomes allow us the first insight into the core features and evolutionary changes in the cyclic nucleotide signalling genes. DD uses the adenylate cyclases ACA, ACB and ACG and the guanylate cyclases sGC and GCA for synthesis of cAMP and cGMP, respectively (Saran et al. 2002). All five cyclases are conserved in PP and DF (Figure 4 main text). Transmembrane (TM) helices and functional domains are also conserved between taxa with the exception of the gain of a single C-terminal TM helix by PP ACB. PP additionally carries two gene duplications of ACA, with the ACA1 flanking genes showing synteny with DD ACA, and therefore identifying it as the DD ACA ortholog. The transient activation of ACA by a complex signalling pathway generates the waves of cAMP that shape the aggregates and fruiting structures. Three essential components of this pathway: CRAC, PIA and Rip3 (Lee et al. 2005) are also present in PP and DF, suggesting that the dynamic regulation of ACA is conserved

40 throughout the Dictyostelia. The G-protein regulated cARs and intracellular cyclic nucleotide binding (cNMP) domains are the sole known targets for cyclic nucleotides in DD. There are 4 cARs in DD with cAR1 being the ancestral gene (Louis et al. 1994) (Alvarez-Curto et al. 2005). This gene duplicated once in PP and related taxa (Kawabe et al. 2009) and twice in DF (see also Fig. S5.5). cNMP domains are present in the regulatory subunit of PKA (PkaR), the cGMP binding proteins GbpC and GbpD and the phosphodiesterases PdeD and PdeE. PdeD is a cGMP phosphodiesterase that is stimulated by cGMP binding to its cNMP domains, while PdeE is a cAMP-stimulated cAMP phosphodiesterase (Meima et al. 2002; Meima et al. 2003). GbpC is a multidomain protein in which cGMP binding to its cNMP domains sequentially activates the intrinsic RasGEF, Ras/Roc and protein kinase domain, eventually causing cell polarisation. GbpD also contains a RasGEF domain, but no kinase domain. Its cNMP binding domains are not functional and it functions as an antagonist of GbpC in cell polarization (Bosgraaf et al. 2005; van Egmond et al. 2008). The PkaR cNMP domains are most similar to those of metazoan PkaRs (Figure S5.7), while those of PdeD and PdeE most closely resemble ciliate and prokaryote domains that are similarly linked to the PDE_III domain. The cNMP domains of GpbC and GbpD are most similar to those of prokaryote cAMP-regulated transcription factors. This indicates that the association of the cNMP domains with their effector domains predates the origin of the Dictyostelia. All 5 cNMP binding proteins with their complete sets of functional domains are present in PP and DF. Degradation of cyclic nucleotides in DD is catalyzed by three classes of phosphodiesterases (PDEs) (Bader et al. 2007). The cAMP PDEs RegA and Pde4 and the cGMP-PDE Pde3 use the PDE_I domain with HDc motif. PdsA and PDE7, which hydrolyze both cAMP and cGMP contain the PDE_II type domain with HSHLDH motif. PdeD and PdeE, which hydrolyze cGMP and cAMP respectively, carry a similar HCHADHDS motif, but are structurally more similar to the -lactamase_II . PdeD and PdeE conservation in PP and DF is presented in the cNMP binding protein phylogeny (Figure S5.6). Pde3, but not Pde4 is conserved in PP and DF, while a second copy of RegA is present in both PP and DF. The PDE_II enzyme PdsA of DD, but not its homolog Pde7, is conserved in PP and DF, with a second copy in DF. All Dictyostelium enzymes contain the signal peptide that targets DD PdsA to the cell surface. PDI, a secreted inhibitor of DD PdsA (Franke et al. 1991) is not present in PP and DF. In conclusion, the genes associated with cyclic nucleotide signalling are extremely well conserved both in number and in domain architecture. This suggests the role of cGMP and most of the multiple roles of cAMP in DD are also present in DF and PP. This does not mean that these roles are invariant; novel roles may have appeared through changes in . This is the case for cAR1, where addition of a distal promoter to the existing post-aggregative promoter of the gene, caused cAR1 to be expressed during aggregation and enabled the chemoattractant function of cAMP during aggregation of DD and other group 4 species (Alvarez-Curto et al. 2005; Kawabe et al. 2009).

41

Figure S5.6. Cyclic nucleotide binding proteins and phosphodiesterases Genes were retrieved from the PP and DF genomes by BLAST query with the DD homologs. The deduced protein sequences were analyzed by SMART for the presence of functional domains. Outgroup sequences for PDE and cNMP binding domains were obtained by Psi-BLAST query of non-redundant proteins in Genbank with subclasses of DD PDE and cNMP functional domains. To build protein phylogenies, conserved shared functional domains were aligned using CLUSTAL-W (Chenna et al. 2003) and phylogenetic relationships between aligned sequences were determined by Bayesian inference (Ronquist and Huelsenbeck 2003). Rate variation between sites was estimated by a gamma distribution with a proportion of invariable sites. Analyses were run for 100,000 generations or until the standard deviation of split frequences was < 0.01. The phylogenetic trees are presented unrooted and decorated with the domain architectures of the proteins. The posterior probabilities (BIPP) of nodes are indicated by colored dots. Locus and gene ids. GbpC, DD: DDB_G0291079, PP: PPL_12173, DF: DFA_03461; GbpD, DD: DDB_G0282373, PP: PPL_00535, DF: DFA_00705; PdeD, DD: DDB_G0274383, PP: PPL_00860, DF: DFA_01848; PdeE, DD: DDB_G0276027, PP: PPL_03231, DF: DFA_00711; PkaR, DD: DDB_G0279413, PP: PPL_11964, DF: DFA_06371; Pde3, DD: DDB_G0268634, PP: PPL_04383, DF: DFA_04243; Pde4, DD: DDB_G0289121; RegA, DD: DDB_G0284331, PP: PPL_02941, DF: DFA_06070; RegB, PP: PPL_09451, DF: DFA_08236; PdsA, DD: DDB_G0285995, PP: PPL_10234, DF: DFA_07709; Pde7, DD:

42 DDB_G0289145, PdsD, DF: DFA_01410.

5.6.3 Sensor histidine kinases Sensor histidine kinases are widely distributed receptor proteins. They are represented by at least 11 subfamilies in prokaryotes and archaea, of which only 1 subfamily has reached the eukaryote domain (Wolanin et al. 2002). This subfamily has representatives in plants, fungi, alveolates and social amoebas, but not in animals. Sensor histidine kinases consist of a receptor domain that regulates the associated histidine kinase/phosphatase (HATPase C) activity, to phosphorylate or dephosphorylate a conserved histidine on the attached phosphorylation/dimerization domain (HisKA). The phosphoryl group is relayed to a conserved aspartate on a response regulator/receiver domain that can either control an effector directly or relay the phosphoryl group to a histidine phosphotransfer (Hpt) protein, which then relays it further to a receiver that activates an effector. The DD genome contains 15 so-called hybrid sensor histidine kinases, where the first receiver is part of the histidine kinase protein. There is one phosphotransfer protein, RdeA and five proteins with a receiver domain. Only one of these proteins, the cAMP phosphodiesterase RegA, has as yet been established to be regulated by histidine kinases (Thomason et al. 1998). RegA crucially regulates intracellular cAMP levels and thereby the activation state of PKA, which controls many processes in the course of DD development, such as growth to development transition, the switch from slug migration to culmination and the maturation of stalk and spore cells. The roles of several DD histidine kinases have been uncovered. DhkA acts as a histidine phophatase when stimulated by SDF-2, a peptide that triggers spore encapsulation (Wang et al. 1999). This causes reverse phosphorelay from RegA, thereby inactivating cAMP hydrolysis and activation of PKA. DhkB detects discadenine, a plant-like cytokine which also triggers spore maturation and inhibits spore germination (Anjard and Loomis 2008; Zinda and Singleton 1998). The histidine kinase activity of DhkC and consequently RegA are activated by the catabolite ammonia, an inhibitor of stalk cell maturation, which is lost by diffusion once slugs project upwards to initiate fruiting body formation (Singleton et al. 1998). DhkK is also essential for fruiting body formation, but its ligand is not known (Thomason et al. 2006). The histidine phosphatase activity of DokA is activated by high osmolality, which results in RegA inactivation and PKA activation. PKA then protects cells from osmotic stress by an as yet unresolved mechanism (Ott et al. 2000; Schuster et al. 1996). Apart form DhkA and DhkI, which are not present in DF and DhkH, which is absent from PP, all other DD histidine kinases are conserved in the DF and PP genomes (Figure S5.7). DF and PP additionally have a novel protein DhkN, which is absent from DD. The domain architecture of the proteins, which can be quite elaborate due to the presence of one or more transmembrane domains, sensors, such as CHASE and GAF domains and in case of DhkG an additional serine/threonine kinase domain are well conserved between orthologs. RegA is duplicated in PP and DF and the other five DD proteins with response regulator domains are also conserved in DF and PP, indicating that their roles, though yet unknown, may be important. Apart from the receiver domain, RegF has a highly conserved 97 AA region (51% identity) at the N-terminus (NRegF), which could be its effector. The RdeA gene is also conserved in PP and DF.

43

Figure S5.7: Conservation and change in sensor histidine kinases and response regulators PP and DF genes were retrieved by BLAST search of the assembled genomes. Deduced protein sequences were analyzed by SMART for the presence of functional domains, signal peptides and transmembrane helices. To build protein phylogenies, conserved shared

44 functional domains (combined HATPaseC and HisK domain for histidine kinases and the receiver domain for response regulators) were aligned using CLUSTAL-W to juxtapose functionally essential amino-acid residues. Regions that did not align unambiguously were deleted. Phylogenetic relationships between aligned sequences were determined by Bayesian inference using a mixed amino-acid model. The functional domain architectures for the histidine kinases (A) and response regulators (B) are mapped onto the phylogeny and are also shown for the single conserved histidine phosphotransfer protein RdeA (C) in DD, PP and DF. Locus and gene ids: DhkA, DD: DDB_G0280961, PP: PPL_06424; DhkB, DD: DDB_G0277845, PP: PPL_05707, DF: DFA_10198; DhkC , DD: DDB_G0274191, PP: PPL_03764, DF: DFA_00637 ; DhkD, DD: DDB_G0282289, PP: PPL_06547, DF: DFA_01554; DhkE, DD: DDB_G0269204, PP: PPL_09010, DF: DFA_08401; DhkF, DD: DDB_G0276143, PP: PPL_06988, DF: DFA_01123; DhkG, DD: DDB_G0284045, PP: PPL_10050, DF: DFA_00985; DhkH, DD: DDB_G0279913, DF: DFA_00944; DhkI, DD: DDB_G0273475, PP: PPL_03537; DhkJ, DD: DDB_G0277883, PP: PPL_00413, DF: DFA_01694; DhkK, DD: DDB_G0277887, PP: PPL_07312, DF: DFA_10049; DhkL, DD: DDB_G0282927, PP: PPL_02718, DF: DFA_05484; DhkM, DD: DDB_G0282377, PP: PPL_09536, DF: DFA_04262; DokA, DD: DDB_G0282377, PP: PPL_04384, DF:DFA_04275; RdeA DD: DDB_G0282923, PP: PPL_09709, DF: DFA_02148. Reg A and Reg B, see legend of Figure S5.6; RegC, DD: DDB_G0291424, PP: PPL_03816, DF:DFA_07787; RegD, DD: DDB_G0287725, PP: PPL_03735a, DF: DFA_11130; RegE, DD: DDB0233892, PP:PPL_10608, DF: DFA_06885; RegF, DD: DDB0190773, PP: PPL_02457, DF: DFA_07389. NRegF: well conserved region N-terminal of RegF receiver.

5.6.4 Monomeric G-proteins Components involved in signaling to and from small of the Rho family are particularly important for cytoskeletal remodeling during chemotaxis and phagocytosis. In general terms the repertoire of genes encoding proteins that reportedly or presumably participate in Rho signaling is very similar in all three amoeba species and, with the exceptions that are discussed below, every DD gene has an ortholog in both DF and PP (Table S5.10). Here we will discuss the differences in the inventory of Rho signaling components among the three species of amoeba. The reader is referred to (Vlahou and Rivero 2006) for details on the domain architecture and functions of the proteins. One family that has diversified independently in the three species is the family of Rho GTPases (Fig. S5.9). This family currently comprises 20 rac genes and one pseudogene in DD, 14 genes in DF and 21 genes in PP. Several DD rac genes have a direct ortholog in both DF and PP, namely rac1a, racA to racE, racH, racL and racP, indicating that this core of shared genes may play conserved roles in all social amoebae. It also becomes apparent that the ancestral rac1 gene duplicated independently in each organism, resulting in 4 more genes in DD (rac1b, rac1c, racF1, racF2), two in PP and four in DF. Of these four, three (DFA_10340a, DFA_10340b and DFA_10342) are located in tandem in the genome and have the same gene structure, suggesting that they are the result of gene duplications. RacG is encoded by an intronless gene that might have resulted from the retrotransposition of an ancestral racL transcript. RacG is absent in DF, where it could have been secondarily lost. There are no direct orthologs for any of a group of divergent DD rac genes that include racI, racJ and racM to racO, but DF and PP have, respectively, 1 and 8 other genes without clear affiliation, again indicating that the rac family has undergone independent rounds of duplications in all three species. Among the regulators, DF and PP only have one gene encoding a typical RhoGDI. The rdiB gene encoding a truncated RhoGDI in DD might have resulted from a gene duplication that took place after the radiation of the Dictyostelids. Two DD RhoGAPs, gacB and gacII, are absent in both DF and PP. The protein encoded by gacII consists of a RhoGAP domain followed by a SH3 domain and is very similar to the N-terminal half of RacGAP1 (xacA gene), therefore it is likely that gacII resulted from a partial duplication of xacA in DD. Both DF and PP have one RhoGAP whose DD ortholog lacks the RhoGAP domain and DF has one additional RhoGAP apparently unique to this species.

45 There are also some differences in the inventory of Rho GEFs. Four DD RhoGEFs, gxcH, gxcZ, DDB_G0293338 and gxcKK, are absent in both DF and PP. Because gxcH, gxcZ and DDB_G0293338 are very similar to, respectively, gxcI, gxcZ and gxcY, these might be, like in the case of rdiB, the result of duplications that took place after the radiation of the Dictyostelids. PP has one and DF two additional Rho GEFs that appear unique to the respective species. The two new DF Rho GEFs are related to each other. The inventory of GEFs of the Dock180 family is identical in all three species, but DF has two additional genes encoding Elmo proteins. Among the effectors, both DF and PP have two genes encoding WASP. One of them is the ortholog of DD DDB_G0272811, which encodes a protein lacking a WH1 domain. Because the DF and the PP proteins conserve the WH1 domain, it is possible that the DD gene underwent a truncation. DF apparently lacks one formin gene (forH) and one of the two Ras GEFs with a formin related Rho-binding domain, gefV, is missing in both DF and PP. GefV is probably a duplication of gefL characteristic of DD. Finally, the class of PI4P 5 kinases has undergone expansion in PP, where a second divergent copy of the rpkA gene exists, and in DF two genes placed in tandem encode phospholipase C proteins that are 80% similar to each other.

Table S5.10. Genes involved in Rho signaling in DF and PP in comparison to DD

Protein Class Gene numbers Relevant domain Closest relatives in or component1 other organisms DF PP DD

Rho GTPases Rac-like 6 46 GTPase Rac RhoBTB-like 1 1 1 GTPase RhoBTB Other RhoGTPases2 7 15 14 GTPase Rac (Unique)

Dissociation inhibitors RhoGDI 1 1 2 RhoGDI RhoGDI

Exchange factors RhoGEF3 45 44 47 RhoGEF (DH) Mostly unique CZH + Elmo 8 + 8 8 + 6 8 + 6 CZH2 (DHR-2) DOCK180, MBC, CED-5 Darlin 1 1 1 SmgGDS, Yeb3p

GTPase activating proteins RhoGAP3 46 45 46 RhoGAP Mostly unique

Effectors and other Rho GTPase-binding proteins Scar complex 5 5 5 PIR121 WAVE complex WASP 1 1 1 CRIB WASP WASP-related 2 2 2 CRIB WASP/WAVE (Unique) PAK4 8 8 9 CRIB PAK kinases Gelsolin-related 1 1 1 CRIB Formins 9 10 10 GBD Formins Formin–related RasGEF 1 1 2 GBD IQGAP-related 4 44 GRD IQGAP PCH family 3 3 3 HR1 CIP4, Toca-1, Cdc15p

46 Lipid kinases5 15 16 15 p85 Class I PI-3-kinases (p110), PI-4-P5K, DGK Phospholipases5 6 5 5 Phospholipase C, Phospholipase D1/2 NADPH oxidase >5 >5 >5 p67phox NADPH oxidase Exocyst complex 8 8 8 Sec3, Exo70 Exocyst complex LIS1 1 1 1 LIS1, Pac1 LimE 1 1 1 LIM proteins

The table is based on (Vlahou and Rivero 2006). Note that for many of the protein families included in the table participation in Rho signaling has been documented in some species but not in others, therefore a role in amoebozoa should be taken as putative. 1Relevant domains or components refer, apart from the GTPase, to those involved in interactions with the Rho GTPase if they have been determined. Abbreviations of the relevant domains: CRIB, Cdcd42 and Rac interactive binding (also known as PBD, p21- binding domain); CZH2, CDM-zizimin homology 2 domain (also known as DHR-2, Dock homology region 2); GAP, GTPase activating protein; GBD, Rho GTPase-binding domain of formins; GDI, guanine nucleotide dissociation inhibitor; GEF, guanine nucleotide exchange factor (the RhoGEF domain is also known as DH, Dbl homology domain); GRD, RasGAP- related domain; HR1, protein kinase C-related kinase homology region 1. 2 In DD this group includes RacC to E, RacG to Q and the pseudogene RacK. In DF this group includes RacC to E, RacH, RacL, RacP and one more Rac protein without clear affiliation. In PP this group includes RacC to E, RacG, RacH, RacL, RacP and eight more Rac proteins without clear affiliation. 3 Three genes encode proteins with both RhoGAP and RhoGEF domains. 4 In DD PakH exists as two copies, one on the chromosome 2 duplication. 5 Includes 6 PI3 kinases (p110 subunit), one DAG kinase and 9 (in DD and PP) or 8 (in DF) PI-4-P5 kinase genes. The p85 regulatory subunit of the PI3 kinase, which in vertebrates interacts with Cdc42 and Rac1, is apparently missing in amoebae. This does not exclude a potential regulation by Rho GTPases through other mechanisms, therefore PI 3-kinases have not been excluded from the table. 6 Includes one (in DD and PP) or two (in PP) phospholipase C as well as four phospholipase D proteins related to mammalian PLD1 and PLD2.

47

Figure S5.8: Phylogenetic tree of the Rho family proteins. The 14 DF (in red), the 21 PP (in green) and the 21 DD Rho proteins were incorporated (including a reconstructed sequence of RacK). The phylogenetic tree is based on an alignment of the core GTPase domain devoid of N- and C-terminal repetitive sequences and was constructed using the neighbor joining method with correction for gaps. The scale bar indicates amino acid substitutions per site.

48 5.6.5 ABC transporters The ATP-binding cassette (ABC) transporter family is a large group of transmembrane proteins that translocate a broad range of compounds across cellular membranes in all domains of life. While prokaryotes use ABC transporters for both import and export, eukaryotes are considered to use them only for export (Saurin et al. 1999) of natural products, toxic metabolites and xenobiotics. The most intensively studied human ABC transporters are those mediating multi-drug resistance (Chang 2003), highlighting an important role of these proteins in detoxification. Based on the order and number of ATP binding- and transmembrane domains and signature sequences within the ATP binding domain, the ABC superfamily can be subdivided into eight subfamilies: abcA to abcH (Dean and Allikmets 2001). While most organisms contain several members of most subfamilies, ranging in total from around 20 ABC transporters in protist pathogens to 48 in humans, DD has all subfamilies with up to 24 members per subfamily, in total adding up to 71 proteins (Anjard and Loomis 2002). Only higher plants have twice as many ABC transporters, which reflects their metabolic versatility and the increased dependence of sessile organisms on biocidal defense and attack strategies. The large number of ABC proteins in DD is also considered to reflect their need for chemical warfare with other soil-inhabitants (Anjard and Loomis 2002). However, since this is entirely niche-dependent, it is almost impossible to prove such a role for any individual transporter under laboratory conditions. The functions of most of ABC transporters in DD are therefore unknown, but two proteins ABCG2 and ABCG18 have roles in endocytosis (Brazill et al. 2001), while others, such as TagA, TagB and TagC have acquired crucial roles in developmental signalling. Together with the fourth protein TagD, these ABC transporters harbour a C-terminal serine protease domain. TagC is present in the plasma membrane of prestalk cells, which expose the serine protease domain to the cell surface (presumably through the attached transporter) after stimulation with GABA -aminobutryic acid). The protease domain then processes acyl-coenzyme binding protein A (acbA), which is secreted by prespore cells, also in response to GABA stimulation, to form the peptide SDF-2. SDF-2 then triggers spore maturation by binding to the sensor histidine kinase DhkA on prespore cells (Anjard and Loomis 2005; Anjard and Loomis 2006; Wang et al. 1999). TagA and TagB play earlier roles in prestalk and prespore fate determination (Cabral et al. 2006; Good et al. 2003; Shaulsky et al. 1995). An excellent classification was prepared previously by Anjard and coworkers (Anjard and Loomis 2002) of the DD ABC transporters and their phylogenetic relationships to each other and to transporters from other model organisms with sequenced genomes. We therefore focus here on evolutionary changes in ABC transporters within the social amoebas. Through query of the PP and DF genomes with the IPR003439/ PF00005 ABC transporter domain 68 and 64 putative ABC transporter genes were identified in PP and DF, respectively. The sequences corresponding to the ATP-binding cassette were aligned with each other and those of the DD ABC transporters. The alignment was used to construct a molecular phylogeny, which was annotated with the domain architecture of the proteins (Fig. S5.10). Half transporters, which contain only one ABC domain appear once in the tree, while full transporters with two ABC domains appear twice. The phylogeny subdivided all sequences in about 10 clades, with members of the same ABC subfamily generally grouping together in one or two clades. The relative order of these clades within the tree was largely unresolved, but grouping of sequences within clades was usually robust as judged from the high posterior probability values of the nodes, and the similar to identical domain architecture of related sequences. For all full transporters the N- terminal ABC domains of one subfamily are always more related to each other than to the C- terminal domains of the same subfamily, as exemplified by the abcC family where the N- terminal domains group together in clade 5 and the C-terminal domains at the top of clade 1, and the abcG family with N-terminal domains in clade 4 and C-terminal domains in clade 3. This reinforces an argument also put forward by Anjard and coworkers (Anjard and Loomis 2002) that if the full transporters were formed by tandem gene duplication of half transporters, these events happened a very long time ago. In fact there is no evidence that any full transporter evolved from gene duplication within the time span that separates DD, DF

49 and PP from the last common ancestor A fair proportion of DD genes has a likely ortholog in both PP and DF, such as abcG3 and G16 in clade 3 and 4, abcG1 and G24 in clades 8 and 9, and abcD2 in clade 5. However, this is more the exception than the rule, because most members of particularly subfamilies A, C and G appear to result from group- or species-specific gene family expansions. Remarkably, if a subfamily contains both half transporters and full transporters, the full transporters are much more prone to gene number expansion than the half transporters (see e.g. the A subfamily in clade 2 and the B subfamily in clade 1). This would be expected if the half transporters form heterodimers with another partner. In that case a gene duplication would not result in more functional copies, if the partner was not also duplicated. Compulsory heterodimerization for halftransporters could be established in yeast, which has only two abcD halftransporters (Shani et al. 1996), and would also explain why the phylogeny shows no evidence of recent formation of a full transporter by duplication of a half transporter. Clade 10 (Fig. S5.10, panel B), which encompasses families E, F and H contains mostly triplets of likely DD, PP and DF orthologs. All proteins in this group have the ABC cassette but no transmembrane domains, and are unlikely to function as transporters. They therefore may not experience similar selection for gene amplification. Another point of interest is that the DD triplet TagD, TagC and TagB in clade 1 is represented by only a single gene in PP and DF, which, in phylogenetic analysis based on alignment of either the ABC cassette or the entire protein (Fig. S5.10, panel A) is more similar to TagD, than to TagC and TagB. The three genes are present in the order D,C,B as a tandem array in the DD genome, which probably also reflects the temporal order of the duplication events that created these genes. The absence of TagC and TagB in PP and DF suggest that their role in DD spore maturation and cell-type specification may be fairly novel. In case of TagC’s role in spore maturation this is also borne out by the absence of the SDF-2 target DhkA in DF, which acts downstream of TagC in the sporulation cascade. TagA appears to have lost its serine protease domain in PP, which may also impact on the regulation cell-type specification in this organism.

50

51

Figure S5.9: ABC transporters. All sequences containing the Interpro IPR003439 ABC transporter-like ATPase domains were retrieved from the PP, DF and DD genomes. The single or double IPR003439 domain sequences were isolated from the complete proteins and separately aligned using ClustalW. Low complexity insertions were deleted from the alignment, which was used to construct a phylogenetic tree by Bayesian inference (Ronquist and Huelsenbeck 2003). The entire tree, which consists of about 10 clades, is shown as an overview (A). Clades 1-9 (A) and 10 (B) are also shown separately for detailed examination. Sequences are decorated with the domain architectures of the proteins, which are plotted as a generalized architecture per group if they are the same for all members of the group, and individually per sequence if this is not the case. For transporters with two ABC domains, addition of “a” or “b” to the locus tag denotes the N- or C-terminal domain respectively. For the C-terminal domains, full protein architectures (which duplicate those shown with the N- terminal domains) are presented in washed-out color. Bayesian Inference Posterior Probability (BIPP) values of nodes are indicated by colored dots.

6. Supplemental Methods

6.1 DNA isolation To avoid contamination with bacterial DNA, which would be preferentially cloned (Glöckner 2000), PP PN500 was grown in axenic medium and DF SH3, which cannot grow axenically, was harvested from bacterial growth plates well after the plates had cleared and starved for another 6 h, while shaking in phosphate buffer. DNA was isolated from purified nuclei lysed in SDS, and purified by phenol extraction and ethanol precipitation after incubation with proteinase K and RNAseA. Only libraries with no bacterial clones were used for further analysis as tested on 96 individual clones by BLAST search against bacterial databases.

6.2 Sequencing and Assembly Plasmid (pUC) based sequencing libraries were constructed as previously described (Glöckner et al. 2002). Fosmid libraries with average insert sizes of 33 kb in vector pCC2FOS were prepared according to the manufacturers instructions (Epicentre Biotechnologies). DNA sequencing was done with the BigDye kit from ABI using standard forward and reverse primers. Pre-assembly trimming of sequences was performed with a modified version of Phred. The sequencing libraries for the 454/Roche flx system were prepared according to the manufacturers protocols. The 454/Roche data were assembled using the newbler software. All contigs larger than 500 bases were entered in the Staden

52 package database, including the newbler derived quality values. The Sanger based sequencing reads were then added using the gap assembler.

6.3 Chromosome structure analysis The consensus sequences of the telomere repeats were used to search for contigs and single reads containing these repeats using BLASTn in the whole data set. The assembly of all contigs with such repeats was checked for wrongly assembled reads manually, since repetitive sequences tend to cause misassemblies. Consensus sequences of the DD repetitive elements were used as query to find similar sequences in the genomes of the other social amoebae. Regions adjacent to telomeres were searched for other repeated sequence motifs. Detected repeat units were then used as query to find additional identical sequences throughout the genome.

6.4 Gene prediction, Blast and synteny analysis tRNA genes were predicted using trnascan (Lowe and Eddy 1997). For protein coding gene prediction we used geneid (Glöckner et al. 2002; Parra et al. 2000) after training with transcriptdata obtained by 454Roche/flx sequencing. Specifically, single transcript reads were aligned to the reference genome sequence using exalign (http://159.149.109.9/exalign/). Intron-Exon boundaries were extracted from the resulting alignments and used as a training set for geneid. Furthermore, coding sequences from defined orthologs covered by transcript reads were used as training set to define coding sequence features. For the analysis of specific gene families the predicted gene models were thoroughly examined and corrected where needed and possible. We examined more than 700 genes for this paper. Nearly 1/3rd of those are large multidomain genes and therefore more prone to error. We found that 233 of the examined gene models required correction, since they were split or fused genes or contained incorrectly predicted intron/exon boundaries. Thus we would estimate 1/3rd of gene models to be incorrect, but since most of the predictions were small, we are confident that our global analysis of gene families and domains is only slightly compromised by these errors. To define gene families irrespective of direct alignability of the proteins we used orthoMCL with default values. This gave us an overview on the general structure of gene family expansions, but this method yields only a rough clustering of similar sequences. For the definition of orthologs we used Blast analysis of all CDS of one genome against the other yielding best bidirectional hits (BBH). We used a filter threshold for significant hits of 20 % identity between amino acid sequences over at least 50 % of the protein to define the core proteome of social amoebae. A synteny group was defined as being a chromosomal segment encoding the same CDS in both species irrespective of the order. Within a synteny group orthologous pairs can be separated by as much as five unrelated genes including tRNAs. Clear orthology relationships are confounded by differential gene family expansions. This is why we compared each protein set to each of the others and obtained slightly differing numbers for shared genes.

6.5 Gene family detection using domain analysis We took into account that not all members of a specific family would be detectable using simple BLAST algorithms, due to the occlusion of similarities by accumulating mutations over such long time periods as are spanned by the evolution of the species examined. Thus, in all cases, the prospective members of the analysed gene families were retrieved by identifying domain structures using IPRscan (Mulder and Apweiler 2008). This way we also were able to detect family members, obscured by erroneous predictions. Similarities detected by our BBH approach were not counted, if a required functional domain could not be detected. PKS domains were detected using available software (SearchPKS (Yadav et al. 2003)).

6.6 Protein variation For the determination of protein variations we used the orthologous gene set defined by BBHs (see 6.4). dn and ds for the species pairs were determined as described in (Gonzales

53 et al. 2002). The value in Figure S6.1 becomes lower the more non-synonymous substitutions occur in a given protein compared to the synonymous exchanges. Thus, Figure S6.1 indicates overall higher divergence between DD and the other two species compared to the DF/PP species pair. On the other hand, the human/hydra comparison indicates a comparable protein evolution between these two different eukaryote clades.

Figure S6.1: (Sd-Nd)/(Sd+Nd) ratios of all proteins with orthologs in species pairs. DD/DF: DD proteins compared to DF orthologs, DD/PP: DD proteins compared to PP orthologs, and DF/PP: DF proteins compared to PP orthologs, respectively. Only true orthologs with a minimum sequence identity of 40% over at least 50% of the gene length were used for calculation. x axis: ratios from -1 to +1; y axis: number of protein pairs. The comparison of the orthologous Human/Hydra pairs with the same threshold parameters is depicted as dashed line. The secondary peak observed in this curve might be due to paralogous sequences, which are presumably derived from a genome duplication event in the animal lineage (Edger and Pires 2009). The calculation of the ratios was done using the method described in (Gonzales et al. 2002).

6.7 Phylogeny and species split dating We established the orthology relationships between proteins of social amoebaes (PP, DF, DD), plants (Arabidopsis thaliana, Physcomitrella patens), Metazoa (Hydra magnapapillata,

54 Homo sapiens), and fungi (Saccharomyces cerevisiae, Schizosaccharomyces pombe) by using best reciprocal blast hits. Out of the orthologous proteins we randomly chose 33 for further phylogenetic analysis. The 33 protein sequences were concatenated and then aligned using clustalw. Using MEGA we reconstructed a neighbor joining phylogenetic tree (Pairwise gap deletion, Poisson correction, bootstrap test with 1000 replicates, Fig. 5B, main text). The root of the tree was set between plants and all other species. Bayesian inference yielded the same tree topology as the Neighbor joining method. We calibrated the tree using an estimated divergence time of 750 mya for the Hydra human split. For the plant split this yields a divergence time of 561 mya, for the fungal split 926 mya, and for the social amoebae split 597 mya. The DF/PP split was placed at 494 mya. The data for the plants, metazoan, and fungi are in good agreement with previous estimates (Hedges 2002). Thus, our estimate for the split between social amobae is likely to have occurred before the divergence of plants shortly before or during the Cambrian explosion.

Table S6.1: Accessions of proteins used for the phylogenetic reconstruction

Physcomitrella Arabidopsis budding_yeast schizo_yeast human Hydra PP DF DD jgi|Phypa1_1|2262 gi|15230534|r gi|6320950|ref|N gi|221124574|ref|XP_0021 16|estExt_Genewis ef|NP_187864 SPCC1739.13 gid:274866 PPL_00167 DFA_05768 DDB0185047 P_011029.1| 66887.1| e1.C_3510021 .1| jgi|Phypa1_1|1970 gi|15237195|r gi|6322605|ref|N gi|221110073|ref|XP_0021 71|estExt_gwp_gw ef|NP_200650 SPBC19F8.08 gid:160068 PPL_06376 DFA_10200 DDB0185063 P_012679.1| 54886.1| 1.C_2820013 .1| gi|79326317|r jgi|Phypa1_1|1122 gi|51830259|gb| gi|221124822|ref|XP_0021 ef|NP_001031 SPCC757.07c gid:264452 PPL_04263 DFA_07977 DDB0185123 28|e_gw1.3.315.1 AAU09703.1| 66319.1| 791.1| jgi|Phypa1_1|2102 gi|22326813|r gi|45269878|gb| gi|221122867|ref|XP_0021 24|estExt_Genewis ef|NP_196968 SPAC1687.15 gid:155451 PPL_02520 DFA_06870 DDB0185150 AAS56320.1| 56819.1| e1.C_600049 .2| jgi|Phypa1_1|8518 gi|18402435|r gi|6324644|ref|N gi|221119226|ref|XP_0021 5|fgenesh1_pg.sca ef|NP_565706 SPBC530.01 gid:135153 PPL_06220 DFA_02906 DDB0190714 P_014713.1| 64405.1| ffold_131000085 .1| gi|18414193|r jgi|Phypa1_1|1445 gi|6320077|ref|N gi|221102685|ref|XP_0021 ef|NP_568114 SPAC1565.08 gid:255256 PPL_08214 DFA_02539 DDB0191154 96|e_gw1.217.6.1 P_010157.1| 68499.1| .1| jgi|Phypa1_1|1805 gi|18397283|r gi|6322569|ref|N gi|221104835|ref|XP_0021 34|estExt_gwp_gw ef|NP_564338 SPBC215.08c gid:102657 PPL_05767 DFA_05982 DDB0201646 P_012643.1| 68345.1| 1.C_430044 .1| jgi|Phypa1_1|1607 gi|79321186|r gi|6319281|ref|N gi|221116249|ref|XP_0021 97|estExt_fgenesh ef|NP_001031 SPAC9.07c gid:276945 PPL_12499 DFA_08181 DDB0205772 P_009364.1| 54991.1| 1_pg.C_260074 270.1| jgi|Phypa1_1|1939 gi|15219669|r gi|6320863|ref|N SPBC17G9.0 gi|221116731|ref|XP_0021 16|estExt_gwp_gw ef|NP_171913 gid:95350 PPL_02462 DFA_07552 DDB0215353 P_010942.1| 9 62239.1| 1.C_2070079 .1| gi|30689768|r jgi|Phypa1_1|1304 gi|6323028|ref|N SPBC12C2.0 gi|221116393|ref|XP_0021 ef|NP_850420 gid:195976 PPL_01809 DFA_08103 DDB0215390 13|e_gw1.82.16.1 P_013100.1| 8 64260.1| .1| jgi|Phypa1_1|2271 gi|22330310|r gi|6321315|ref|N gi|221132992|ref|XP_0021 52|estExt_Genewis ef|NP_683443 SPCC576.08c gid:263316 PPL_03716 DFA_00370 DDB0215391 P_011392.1| 64951.1| e1.C_4000009 .1| gi|30695816|r jgi|Phypa1_1|1438 gi|14719772|pdb gi|221124704|ref|XP_0021 ef|NP_850742 SPAC1834.07 gid:217617 PPL_02778 DFA_11546 DDB0216174 52|e_gw1.209.8.1 |1F9T|A 58951.1| .1| jgi|Phypa1_1|1796 gi|18406515|r gi|6324328|ref|N gi|221108566|ref|XP_0021 51|estExt_gwp_gw ef|NP_566016 SPAC6C3.04 gid:90009 PPL_03366 DFA_05444 DDB0220638 P_014398.1| 63668.1| 1.C_370027 .1| gi|18407829|r jgi|Phypa1_1|1362 gi|6321017|ref|N SPAC1002.05 gi|221112064|ref|XP_0021 ef|NP_564814 gid:274532 PPL_11710 DFA_11947 DDB0220639 90|e_gw1.127.10.1 P_011096.1| c 53783.1| .1| gi|18422029|r jgi|Phypa1_1|1295 gi|6319612|ref|N gi|221115121|ref|XP_0021 ef|NP_198898 SPBC216.05 gid:271004 PPL_11229 DFA_05791 DDB0229332 84|e_gw1.77.5.1 P_009694.1| 60018.1| .2| jgi|Phypa1_1|9561 gi|15233349|r gi|6323335|ref|N SPAC24C9.0 gi|221129169|ref|XP_0021 9|fgenesh1_pg.sca ef|NP_195308 gid:88046 PPL_12255 DFA_03121 DDB0229908 P_013407.1| 6c 65831.1| ffold_283000055 .1| jgi|Phypa1_1|1981 gi|15227588|r gi|547604|emb|C SPBC18H10. gi|221107701|ref|XP_0021 61|estExt_gwp_gw ef|NP_181158 gid:100527 PPL_04103 DFA_01966 DDB0230056 AA54769.1| 13 54189.1| 1.C_3160005 .1| jgi|Phypa1_1|5947 gi|15218940|r gi|1070438|pir||D gi|221129918|ref|XP_0021 5|fgenesh1_pm.sc ef|NP_176198 SPAC26F1.03 gid:197689 PPL_00689 DFA_10174 DDB0230193 EBYPA 62893.1| affold_169000019 .1| jgi|Phypa1_1|2140 gi|18404552|r gi|6325126|ref|N SPBC11C11. gi|221090919|ref|XP_0021 65|estExt_Genewis ef|NP_566767 gid:287171 PPL_07225 DFA_08540 DDB0231242 P_015194.1| 09c 56632.1| e1.C_970070 .1| jgi|Phypa1_1|2271 gi|30690426|r gi|14318437|ref| SPBC25H2.0 gi|221122393|ref|XP_0021 67|estExt_Genewis ef|NP_198035 gid:87941 PPL_12508 DFA_10984 DDB0231248 NP_116578.1| 2 64939.1| e1.C_4000028 .2| jgi|Phypa1_1|1855 gi|15223829|r gi|6321531|ref|N SPBC1709.02 gi|221121176|ref|XP_0021 56|estExt_gwp_gw ef|NP_172913 gid:86734 PPL_10017 DFA_02164 DDB0231269 P_011608.1| c 62217.1| 1.C_840085 .1| jgi|Phypa1_1|1934 gi|18414179|r gi|6324709|ref|N SPAC11G7.0 gi|221119080|ref|XP_0021 92|estExt_gwp_gw ef|NP_568113 gid:89099 PPL_08952 DFA_05642 DDB0231288 P_014779.1| 3 67502.1| 1.C_1990034 .1| gi|15228692|r jgi|Phypa1_1|1137 gi|6321808|ref|N SPBC19C7.0 gi|221124836|ref|XP_0021 ef|NP_191771 gid:87973 PPL_10600 DFA_03180 DDB0231317 73|e_gw1.7.88.1 P_011884.1| 6 67068.1| .1|

55 jgi|Phypa1_1|1045 gi|15225353|r gi|6321683|ref|N gi|221120051|ref|XP_0021 57|estExt_fgenesh ef|NP_179632 SPCC1620.08 gid:185797 PPL_02931 DFA_06056 DDB0231358 P_011760.1| 57521.1| 1_pm.C_90010 .1| jgi|Phypa1_1|1814 gi|145331433| gi|6324993|ref|N gi|221103278|ref|XP_0021 28|estExt_gwp_gw ref|NP_00107 SPCC18.18c gid:257297 PPL_05263 DFA_10203 DDB0231400 P_015061.1| 61201.1| 1.C_490093 8075.1| gi|15242863|r jgi|Phypa1_1|1316 gi|6324923|ref|N gi|221104593|ref|XP_0021 ef|NP_201173 SPAC4H3.10c gid:250776 PPL_01838 DFA_01601 DDB0231421 69|e_gw1.90.117.1 P_014992.1| 65702.1| .1| jgi|Phypa1_1|1983 gi|15225553|r gi|6320148|ref|N gi|221112728|ref|XP_0021 40|estExt_gwp_gw ef|NP_181507 SPCC1906.01 gid:117721 PPL_06240 DFA_12310 DDB0231665 P_010228.1| 63620.1| 1.C_3230009 .1| gi|30681987|r jgi|Phypa1_1|1048| gi|6323844|ref|N SPAC2G11.1 gi|221126273|ref|XP_0021 ef|NP_172562 gid:130420 PPL_07301 DFA_10088 DDB0233085 gw1.243.14.1 P_013915.1| 2 64412.1| .2| jgi|Phypa1_1|2055 gi|15218086|r gi|6321020|ref|N gi|221128221|ref|XP_0021 35|estExt_Genewis ef|NP_173520 SPAC9.03c gid:235646 PPL_00021 DFA_02078 DDB0233131 P_011099.1| 65482.1| e1.C_280123 .1| jgi|Phypa1_1|2197 gi|30693962|r gi|407521|emb|C gi|221091261|ref|XP_0021 62|estExt_Genewis ef|NP_851119 SPCC1739.13 gid:276111 PPL_06996 DFA_10906 DDB0233663 AA81523.1| 61172.1| e1.C_1810001 .1| jgi|Phypa1_1|1061 gi|30690306|r gi|6323795|ref|N gi|221129408|ref|XP_0021 84|estExt_fgenesh ef|NP_182152 SPAC4D7.05 gid:144229 PPL_11335 DFA_00930 DDB0233910 P_013866.1| 67671.1| 1_pm.C_650013 .2| gi|42563275|r jgi|Phypa1_1|1467 gi|6319282|ref|N gi|221121018|ref|XP_0021 ef|NP_177807 SPAC56F8.03 gid:75871 PPL_06786 DFA_10229 DDB0234259 18|e_gw1.247.8.1 P_009365.1| 55134.1| .3| gi|15223458|r jgi|Phypa1_1|1314 gi|6323226|ref|N gi|221103067|ref|XP_0021 ef|NP_176007 SPBC646.10c gid:203585 PPL_06237 DFA_02824 DDB0237576 57|e_gw1.88.50.1 P_013298.1| 56048.1| .1|

56 7. References

Alvarez-Curto, E., D.E. Rozen, A.V. Ritchie, C. Fouquet, S.L. Baldauf, and P. Schaap. 2005. Evolutionary origin of cAMP-based chemoattraction in the social amoebae. Proc Natl Acad Sci U S A 102: 6385-6390. Anjard, C. and W.F. Loomis. 2002. Evolutionary analyses of ABC transporters of Dictyostelium discoideum. Eukaryot Cell 1: 643-652. Anjard, C. and W.F. Loomis. 2005. Peptide signaling during terminal differentiation of Dictyostelium. Proc Natl Acad Sci U S A 102: 7607-7611. Anjard, C. and W.F. Loomis. 2006. GABA induces terminal differentiation of Dictyostelium through a GABAB receptor. Development 133: 2253-2261. Anjard, C. and W.F. Loomis. 2008. Cytokinins induce sporulation in Dictyostelium. Development 135: 819-827. Aubry, L. and R. Firtel. 1999. Integration of signaling networks that regulate Dictyostelium differentiation. Annu Rev Cell Dev Biol 15: 469-517. Austin, M.B., T. Saito, M.E. Bowman, S. Haydock, A. Kato, B.S. Moore, R.R. Kay, and J.P. Noel. 2006. Biosynthesis of Dictyostelium discoideum differentiation-inducing factor by a hybrid type I fatty acid-type III polyketide synthase. Nat Chem Biol 2: 494-502. Bader, S., A. Kortholt, and P.J. Van Haastert. 2007. Seven Dictyostelium discoideum phosphodiesterases degrade three pools of cAMP and cGMP. Biochem J 402: 153- 161. Bauer, P.O. and N. Nukina. 2009. The pathogenic mechanisms of polyglutamine diseases and current therapeutic strategies. J Neurochem 110: 1737-1765. Benabentos, R., S. Hirose, R. Sucgang, T. Curk, M. Katoh, E.A. Ostrowski, J.E. Strassmann, D.C. Queller, B. Zupan, G. Shaulsky, and A. Kuspa. 2009. Polymorphic members of the lag gene family mediate kin discrimination in Dictyostelium. Curr Biol 19: 567- 572. Bhutkar, A., S.W. Schaeffer, S.M. Russo, M. Xu, T.F. Smith, and W.M. Gelbart. 2008. Chromosomal rearrangement inferred from comparisons of 12 Drosophila genomes. Genetics 179: 1657-1680. Blomberg, P., C. Randolph, C.H. Yao, and M.C. Yao. 1997. Regulatory sequences for the amplification and replication of the ribosomal DNA minichromosome in Tetrahymena thermophila. Mol Cell Biol 17: 7237-7247. Bockaert, J. and J.P. Pin. 1999. Molecular tinkering of G protein-coupled receptors: an evolutionary success. Embo J 18: 1723-1729. Bosgraaf, L., A. Waijer, R. Engel, A.J. Visser, D. Wessels, D. Soll, and P.J. van Haastert. 2005. RasGEF-containing proteins GbpC and GbpD have differential effects on cell polarity and chemotaxis in Dictyostelium. J Cell Sci 118: 1899-1910. Brazill, D.T., L.R. Meyer, R.D. Hatton, D.A. Brock, and R.H. Gomer. 2001. ABC transporters required for endocytosis and endosomal pH regulation in Dictyostelium. J Cell Sci 114: 3923-3932. Cabral, M., C. Anjard, W.F. Loomis, and A. Kuspa. 2006. Genetic evidence that the acyl coenzyme A binding protein AcbA and the serine protease/ABC transporter TagA function together in Dictyostelium discoideum cell differentiation. Eukaryot Cell 5: 2024-2032. Chang, G. 2003. Multidrug resistance ABC transporters. FEBS Lett 555: 102-105. Chenna, R., H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, and J.D. Thompson. 2003. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 31: 3497-3500. Cornillon, S., L. Gebbie, M. Benghezal, P. Nair, S. Keller, B. Wehrle-Haller, S.J. Charette, F. Bruckert, F. Letourneur, and P. Cosson. 2006. An adhesion molecule in free-living

57 Dictyostelium amoebae with integrin beta features. EMBO Rep 7: 617-621. Cox, E.C., C.D. Vocke, S. Walter, K.Y. Gregg, and E.S. Bain. 1990. Electrophoretic karyotype for Dictyostelium discoideum. Proc Natl Acad Sci U S A 87: 8247-8251. Dean, M. and R. Allikmets. 2001. Complete characterization of the human ABC gene family. J Bioenerg Biomembr 33: 475-479. Edger, P.P. and J.C. Pires. 2009. Gene and genome duplications: the impact of dosage- sensitivity on the fate of nuclear genes. Chromosome Res 17: 699-717. Eichinger, L., J.A. Pachebat, G. Glöckner, M.A. Rajandream, R. Sucgang, M. Berriman, J. Song, R. Olsen, K. Szafranski, Q. Xu, B. Tunggal, S. Kummerfeld, M. Madera, B.A. Konfortov, F. Rivero, A.T. Bankier, R. Lehmann, N. Hamlin, R. Davies, P. Gaudet, P. Fey, K. Pilcher, G. Chen, D. Saunders, E. Sodergren, P. Davis, A. Kerhornou, X. Nie, N. Hall, C. Anjard, L. Hemphill, N. Bason, P. Farbrother, B. Desany, E. Just, T. Morio, R. Rost, C. Churcher, J. Cooper, S. Haydock, N. van Driessche, A. Cronin, I. Goodhead, D. Muzny, T. Mourier, A. Pain, M. Lu, D. Harper, R. Lindsay, H. Hauser, K. James, M. Quiles, M. Madan Babu, T. Saito, C. Buchrieser, A. Wardroper, M. Felder, M. Thangavelu, D. Johnson, A. Knights, H. Loulseged, K. Mungall, K. Oliver, C. Price, M.A. Quail, H. Urushihara, J. Hernandez, E. Rabbinowitsch, D. Steffen, M. Sanders, J. Ma, Y. Kohara, S. Sharp, M. Simmonds, S. Spiegler, A. Tivey, S. Sugano, B. White, D. Walker, J. Woodward, T. Winckler, Y. Tanaka, G. Shaulsky, M. Schleicher, G. Weinstock, A. Rosenthal, E.C. Cox, R.L. Chisholm, R. Gibbs, W.F. Loomis, M. Platzer, R.R. Kay, J. Williams, P.H. Dear, A.A. Noegel, B. Barrell, and A. Kuspa. 2005. The genome of the social amoeba Dictyostelium discoideum. Nature 435: 43-57. Fey, P., S. Stephens, M.A. Titus, and R.L. Chisholm. 2002. SadA, a novel adhesion receptor in Dictyostelium. J Cell Biol 159: 1109-1119. Franke, J., M. Faure, L. Wu, A.L. Hall, G.J. Podgorski, and R.H. Kessin. 1991. Cyclic nucleotide phosphodiesterase of Dictyostelium discoideum and its glycoprotein inhibitor: structure and expression of their genes. Dev Genet 12: 104-112. Friedberg, F. and F. Rivero. 2010. Single and multiple CH (calponin homology) domain containing multidomain proteins in Dictyostelium discoideum: an inventory. Mol Biol Rep 37: 2853-2862. Glöckner, G. 2000. Large Scale Sequencing and Analysis of AT Rich Eukaryote Genomes. Curr. Genomics 1: 289-299. Glöckner, G., L. Eichinger, K. Szafranski, J.A. Pachebat, A.T. Bankier, P.H. Dear, R. Lehmann, C. Baumgart, G. Parra, J.F. Abril, R. Guigo, K. Kumpf, B. Tunggal, E. Cox, M.A. Quail, M. Platzer, A. Rosenthal, and A.A. Noegel. 2002. Sequence and analysis of chromosome 2 of Dictyostelium discoideum. Nature 418: 79-85. Gonzales, M.J., J.M. Dugan, and R.W. Shafer. 2002. Synonymous-non-synonymous mutation rates between sequences containing ambiguous nucleotides (Syn-SCAN). Bioinformatics 18: 886-887. Good, J.R., M. Cabral, S. Sharma, J. Yang, N. Van Driessche, C.A. Shaw, G. Shaulsky, and A. Kuspa. 2003. TagA, a putative serine protease/ABC transporter of Dictyostelium that is required for cell fate determination at the onset of development. Development 130: 2953-2965. Grimson, M.J., J.C. Coates, J.P. Reynolds, M. Shipman, R.L. Blanton, and A.J. Harwood. 2000. Adherens junctions and beta-catenin-mediated cell signalling in a non-metazoan organism. Nature 408: 727-731. Hedges, S.B. 2002. The origin and evolution of model organisms. Nat Rev Genet 3: 838-849. Hughes, R.E., R.S. Lo, C. Davis, A.D. Strand, C.L. Neal, J.M. Olson, and S. Fields. 2001. Altered transcription in yeast expressing expanded polyglutamine. Proc Natl Acad Sci U S A 98: 13201-13206.

58 John, U., B. Beszteri, E. Derelle, Y. Van de Peer, B. Read, H. Moreau, and A. Cembella. 2008. Novel Insights into Evolution of Protistan Polyketide Synthases through Phylogenomic Analysis. Protist 159: 21-30. Joseph, J.M., P. Fey, N. Ramalingam, X.I. Liu, M. Rohlfs, A.A. Noegel, A. Muller- Taubenberger, G. Glöckner, and M. Schleicher. 2008. The actinome of Dictyostelium discoideum in comparison to actins and actin-related proteins from other organisms. PLoS One 3: e2654. Kawabe, Y., T. Morio, J.L. James, A.R. Prescott, Y. Tanaka, and P. Schaap. 2009. Activated cAMP receptors switch encystation into sporulation. Proc Natl Acad Sci U S A 106: 7089-7094. Kirsten, J.H., Y. Xiong, C.T. Davis, and C.K. Singleton. 2008. Subcellular localization of ammonium transporters in Dictyostelium discoideum. BMC Cell Biol 9: 71. Kollmar, M. 2006. Thirteen is enough: the myosins of Dictyostelium discoideum and their light chains. BMC Genomics 7: 183. Kollmar, M. and G. Glöckner. 2003. Identification and phylogenetic analysis of Dictyostelium discoideum kinesins. BMC Genomics 4: 47. Lee, S., F.I. Comer, A. Sasaki, I.X. McLeod, Y. Duong, K. Okumura, J.R. Yates, 3rd, C.A. Parent, and R.A. Firtel. 2005. TOR complex 2 integrates cell movement during chemotaxis and signal relay in Dictyostelium. Mol Biol Cell 16: 4572-4583. Li, L., C.J. Stoeckert, and D.S. Roos. 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 13: 2178-2189. Lin, Z., S. Sriskanthadevan, H. Huang, C.H. Siu, and D. Yang. 2006. Solution structures of the adhesion molecule DdCAD-1 reveal new insights into Ca(2+)-dependent cell-cell adhesion. Nat Struct Mol Biol 13: 1016-1022. Louis, J.M., G.T. Ginsburg, and A.R. Kimmel. 1994. The cAMP receptor CAR4 regulates axial patterning and cellular differentiation during late development of Dictyostelium. Genes Dev 8: 2086-2096. Lowe, T.M. and S.R. Eddy. 1997. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic. Nucleic Acids Res 25: 955-964. Meima, M.E., R.M. Biondi, and P. Schaap. 2002. Identification of a novel type of cGMP phosphodiesterase that is defective in the chemotactic stmF mutants. Mol Biol Cell 13: 3870-3877. Meima, M.E., K.E. Weening, and P. Schaap. 2003. Characterization of a cAMP-stimulated cAMP phosphodiesterase in Dictyostelium discoideum. J Biol Chem 278: 14356- 14362. Mulder, N.J. and R. Apweiler. 2008. The InterPro database and tools for analysis. Curr Protoc Bioinformatics Chapter 2: Unit 2 7. Needleman, S.B. and C.D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48: 443-453. Ott, A., F. Oehme, H. Keller, and S.C. Schuster. 2000. Osmotic stress response in Dictyostelium is mediated by cAMP. Embo J 19: 5782-5792. Pandey, S., D.C. Nelson, and S.M. Assmann. 2009. Two novel GPCR-type G proteins are abscisic acid receptors in Arabidopsis. Cell 136: 136-148. Parra, G., E. Blanco, and R. Guigo. 2000. GeneID in Drosophila. Genome Res 10: 511-515. Prabhu, Y. and L. Eichinger. 2006. The Dictyostelium repertoire of seven transmembrane domain receptors. Eur J Cell Biol 85: 937-946. Ritchie, A.V., S. van Es, C. Fouquet, and P. Schaap. 2008. From drought sensing to developmental control: evolution of cyclic AMP signaling in social amoebas. Mol Biol Evol 25: 2109-2118. Rivero, F. and L. Eichinger. 2005. The microfilament system of Dictyostelium discoideum. In Dictyostelium Genomics (eds. W.F. Loomis and A. Kuspa), pp. 125-171. Horizon

59 Bioscience, Norfolk

Ronquist, F. and J.P. Huelsenbeck. 2003. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19: 1572-1574. Saito, T., A. Kato, and R.R. Kay. 2008. DIF-1 induces the basal disc of the Dictyostelium fruiting body. Dev Biol 317: 444-453. Saito, T., G.W. Taylor, J.C. Yang, D. Neuhaus, D. Stetsenko, A. Kato, and R.R. Kay. 2006. Identification of new differentiation inducing factors from Dictyostelium discoideum. Biochim Biophys Acta 1760: 754-761. Saran, S., M.E. Meima, E. Alvarez-Curto, K.E. Weening, D.E. Rozen, and P. Schaap. 2002. cAMP signaling in Dictyostelium. Complexity of cAMP synthesis, degradation and detection. J Muscle Res Cell Motil 23: 793-802. Saurin, W., M. Hofnung, and E. Dassa. 1999. Getting in or out: early segregation between importers and exporters in the evolution of ATP-binding cassette (ABC) transporters. J Mol Evol 48: 22-41. Schuster, S.C., A.A. Noegel, F. Oehme, G. Gerisch, and M.I. Simon. 1996. The hybrid histidine kinase DokA is part of the osmotic response system of Dictyostelium. Embo J 15: 3880-3889. Schwarzacher, H.G. and F. Wachtler. 1993. The nucleolus. Anat Embryol (Berl) 188: 515- 536. Serafimidis, I. and R.R. Kay. 2005. New prestalk and prespore inducing signals in Dictyostelium. Dev Biol 282: 432-441. Sesaki, H., E.F. Wong, and C.H. Siu. 1997. The cell adhesion molecule DdCAD-1 in Dictyostelium is targeted to the cell surface by a nonclassical transport pathway involving contractile vacuoles. J Cell Biol 138: 939-951. Shani, N., A. Sapag, P.A. Watkins, and D. Valle. 1996. An S. cerevisiae peroxisomal transporter, orthologous to the human adrenoleukodystrophy protein, appears to be a heterodimer of two half ABC transporters: Pxa1p and Pxa2p. Ann N Y Acad Sci 804: 770-772. Shaulsky, G., A. Kuspa, and W.F. Loomis. 1995. A multidrug resistance transporter/serine protease gene is required for prestalk specialization in Dictyostelium. Genes Dev 9: 1111-1122. Singleton, C.K., M.J. Zinda, B. Mykytka, and P. Yang. 1998. The histidine kinase dhkC regulates the choice between migrating slugs and terminal differentiation in Dictyostelium discoideum. Dev Biol 203: 345-357. Skowyra, D., K.L. Craig, M. Tyers, S.J. Elledge, and J.W. Harper. 1997. F-box proteins are receptors that recruit phosphorylated substrates to the SCF ubiquitin-ligase complex. Cell 91: 209-219. Stadler, J., T.W. Keenan, G. Bauer, and G. Gerisch. 1989. The contact site A glycoprotein of Dictyostelium discoideum carries a phospholipid anchor of a novel type. Embo J 8: 371-377. Sucgang, R., G. Chen, W. Liu, R. Lindsay, J. Lu, D. Muzny, G. Shaulsky, W. Loomis, R. Gibbs, and A. Kuspa. 2003. Sequence and structure of the extrachromosomal palindrome encoding the ribosomal RNA genes in Dictyostelium. Nucleic Acids Res 31: 2361-2368. Szafranski, K., N. Jahn, and M. Platzer. 2006. tuple_plot: Fast pairwise nucleotide sequence comparison with noise suppression Bioinformatics 22: 1917-1918. Thomason, P.A., S. Sawai, J.B. Stock, and E.C. Cox. 2006. The histidine kinase homologue DhkK/Sombrero controls morphogenesis in Dictyostelium. Dev Biol 292: 358-370. Thomason, P.A., D. Traynor, G. Cavet, W.T. Chang, A.J. Harwood, and R.R. Kay. 1998. An intersection of the cAMP/PKA and two-component signal transduction systems in

60 Dictyostelium. Embo J 17: 2838-2845. van Egmond, W.N., A. Kortholt, K. Plak, L. Bosgraaf, S. Bosgraaf, I. Keizer-Gunnink, and P.J. van Haastert. 2008. Intramolecular activation mechanism of the Dictyostelium LRRK2 homolog Roco protein GbpC. J Biol Chem 283: 30412-30420. Veltman, D.M., I. Keizer-Gunnik, and P.J. Van Haastert. 2008. Four key signaling pathways mediating chemotaxis in Dictyostelium discoideum. J Cell Biol 180: 747-753. Vlahou, G. and F. Rivero. 2006. Rho GTPase signaling in Dictyostelium discoideum: insights from the genome. Eur J Cell Biol 85: 947-959. Wang, N., F. Soderbom, C. Anjard, G. Shaulsky, and W.F. Loomis. 1999. SDF-2 induction of terminal differentiation in Dictyostelium discoideum is mediated by the membrane- spanning sensor kinase DhkA. Mol Cell Biol 19: 4750-4756. Wolanin, P.M., P.A. Thomason, and J.B. Stock. 2002. Histidine protein kinases: key signal transducers outside the animal kingdom. Genome Biol 3: REVIEWS3013. Wong, E.F., S.K. Brar, H. Sesaki, C. Yang, and C.H. Siu. 1996. Molecular cloning and characterization of DdCAD-1, a Ca2+-dependent cell-cell adhesion molecule, in Dictyostelium discoideum. J Biol Chem 271: 16399-16408. Yadav, G., R.S. Gokhale, and D. Mohanty. 2003. SEARCHPKS: A program for detection and analysis of polyketide synthase domains. Nucleic Acids Res 31: 3654-3658. Yamada, Y., H.Y. Wang, M. Fukuzawa, G.J. Barton, and J.G. Williams. 2008. A new family of transcription factors. Development 135: 3093-3101. Zinda, M.J. and C.K. Singleton. 1998. The hybrid histidine kinase dhkB regulates spore germination in Dictyostelium discoideum. Dev Biol 196: 171-183. Zucko, J., N. Skunca, T. Curk, B. Zupan, P.F. Long, J. Cullum, R.H. Kessin, and D. Hranueli. 2007. Polyketide synthase genes and the natural products potential of Dictyostelium discoideum. Bioinformatics 23: 2543-2549.

61