Supporting Information

Chang et al. 10.1073/pnas.1511468112 SI Materials and Methods species and 200 ribosomal and nonribosomal protein genes (51,940 Specimen Collection. Locality information is given in Table S1. amino acids); and (iv) a dataset that includes 30 cnidarian species Actinospores of M. cerebralis (kindly provided by R. Hedrick, and 128 nonribosomal protein genes (41,237 amino acids). Addi- University of California, Davis, CA) were collected and flash- tional analyses were also performed excluding either Myxozoa or frozen as they emerged from the annelid host (Fig. 1). For P. hydriforme. For all datasets, Bayesian tree reconstructions were K. iwatai, plasmodia were collected from the fish host. K. iwatai conducted under the CAT model (34) as implemented in Phylo- plasmodia form pseudocysts encapsulated by host cells. Inside bayes MPI vs.1.5 (53). For the third dataset, the CAT-GTR model, the plasmodia, the myxospores are at various stages of matura- which is more computationally intensive, was also used. For the tion (64). Because our RNA extractions were based on several analyses of datasets 1 and 3, two independent chains were run for cysts pooled together, we can assume that our RNA data rep- 10,000 cycles, and trees were saved every 10 cycles. The first 2,000 resent all Kudoa life stages present in the fish. Unfortunately, the trees were discarded (burn-in). For the analysis of the second annelid host of K. iwatai is unknown. P. hydriforme was collected dataset, the two chains were run for 6,000 generations and sampled and flash-frozen 3–5 d after emerging from the host’s oocytes, every 10 trees, and the first 2,000 trees were discarded. For the after it has fragmented into free-living individuals (Fig. 1). analysis of the fourth dataset, the two chains were run for 20,000 generations and sampled every 10 trees, and the first 5,000 trees Host Contaminant Filtering. To filter the assemblies from fish con- were discarded. Chain convergence was evaluated with the bpcomp taminants, genomic sequences were obtained for Sparus aurata and tracecomp programs of the Phylobayes software. The maximum (E. leei and K. iwatai host) and Pterois miles (S. zaharoni host) as and average differences observed at the end of each run were lower described (65, 66). Blast searches were conducted to eliminate con- than 0.0005 for all analyses. Similarly, the effsize and rel_diff pa- taminating fish sequences from the genomic assemblies. Specifically rameters were always higher than 30 and lower than 0.3, re- BLASTN (version 2.2.27+) searches were performed for each of spectively, which indicates a correct chain convergence because our the three myxozoans, using the genomic assembly sequences as analyses investigate the topology rather than branch length and all query against a database of their respective fish host DNA contigs. relevant posterior probabilities = 1.TheMLanalyseswerecon- Sequences of S. aurata available in the NCBI dbEST (www.ncbi. ducted for each dataset under the PROTGAMMAGTR as im- nlm.nih.gov/dbEST/) were also included. The BLASTN parameters plemented in RAxML 8.1.3 (54). Bootstrap support was computed that were used are: -e-value 1e-75 and -perc_identity 85 (62). All after 250 rapid bootstrap replicates. Alignments and Bayesian tree sequences that passed this threshold were considered to be con- have been deposited in the TreeBASE repository (55) (purl.org/ taminants. Furthermore, we performed additional BLASTN phylo/treebase/phylows/study/TB2:S17743?format=html). searches against the NCBI nonredundant nucleotide database (e.g., to remove other contaminant such as bacterial sequences). Genome Size Estimation. Independent estimate of K. iwatai genome To filter the K. iwatai RNA assembly from contaminants (ei- size based on assembly coverage using reads and contig sequences ther host RNA or other sample contaminants), we ran RSEM from two independent sequencing runs. We ran CAP3 (69) with (67) with the default parameters and filtered the low abundance the parameters -o 300 -p 90 (300 bp overlap between contigs and transcripts using the filter_fasta_by_rsem_values.pl script sup- 90% identity in the overlapping sequence) on the GIIx K. iwatai plied by Trinity (68) with default parameters (–fpkm_cutoff DNA assembly to remove redundant contigs. A total of 952 re- = 1200–isopct_cutoff = 1.00). We then ran BLASTN with dundant contigs were removed using CAP3. To filter the K. iwatai -e-value 1e-75 and -perc_identity 85 against the sequences of the HiSeq DNA reads from contaminants, we created a contaminant contaminant database described above (mainly, S. aurata DNA database consisting of the Kudoa sequences identified as con- contigs and ESTs available in the NCBI database). We also added taminants, the S. aurata DNA contigs obtained, and all S. aurata the sequence of the S. aurata mitochondrial genome (65). All ESTs available in the NCBI EST database on May 2014. Bowtie2 sequences identified were removed. We then performed BLASTN was used with default settings to align the K. iwatai HiSeq DNA against the NCBI nucleotide database on the remaining sequences reads to the contaminant database. The –unconc flag was used to with -e-value 1e-75 and -perc_identity 80. We removed all of the save the reads that did not map to the contaminant database to contigs that matched any euteleost sequence with over 80% separate paired-end FASTQ files. A total of 164,284,209 reads identity and other taxa (e.g., fungi, Drosophila) with over 90% remained after this step. We then used Bowtie2 (70) to align the identity. Finally, we then performed a BLASTN search against the filtered paired-end HiSeq reads to the GIIx assembly with default two filtered K. iwatai DNA assemblies (HiSeq and GIIx assem- parameters. The coverage was calculated using bedtools genomecov blies) with an –e-value 1e-5 threshold and removed sequences that -d –ibam on the output of Bowtie2 (70). The average coverage-per- could not align to any of the DNA assemblies. P. hydriforme ge- position of the GIIx assembly was estimated (1391.6; SD: 1095.4; nomic and transcriptomic sequences were filtered using sequence SE: 0.2577). The genome size was then estimated by dividing the material from the paddlefish (Polyodon spathula)oocytetran- number of base pairs sequenced (filtered reads) by the coverage scriptome as described (28). according to the following formula: (number of paired reads) × (read length) × 2)/(average coverage-per-position) = (163,897,028 Phylogenetic Reconstruction. Phylogenetic reconstructions based × 100 × 2)/1,391.6 = 23,555,192. Using this method, the genome on the Bayesian and maximum-likelihood criteria were per- size was thus estimated to be ∼23.5 Mbp. formed for different gene and species combinations: (i) a dataset that includes 77 species representative of the diversity Genome Annotation. Genome annotation was conducted using with their closest outgroups and 200 ribosomal and non- MAKER2 (71), incorporating the Semi-HMM based Nucleic ribosomal protein genes (51,940 amino acids); (ii) a dataset that Acid Parser (SNAP) gene predictor software to assess gene includes 77 species representative of the animal diversity with content of the K. iwatai and P. hydriforme genome assemblies. their closest outgroups and 128 nonribosomal protein genes For each assembly, MAKER2 was first run using the assembled (41,237 amino acids); (iii) a dataset that includes 30 cnidarian transcriptome for each species (EST evidence), a file of the core

Chang et al. www.pnas.org/cgi/content/short/1511468112 1of6 CEGMA proteins, and a random precompiled eukaryotic HMM downloaded from NCBI. Each protein FASTA file was uploaded to profile to train SNAP. The output of this training was a species- the OrthoMCL Groups web server (www.orthomcl.org/orthomcl/ specific HMM profile for each assembly created by SNAP. In the proteomeUpload.do). OrthoMCL analysis results in lists of next round, MAKER2 was used to annotate the K. iwatai and P. OrthoGroup assignments for each protein in an assembly. These hydriforme genomes by using EST evidence, protein evidence for were parsed using R to create lists of unique ortholog group IDs each species, and the species-specific HMM files generated in (OGs) found in each assembly. These lists were used as input for the training round. The number of genes found by MAKER2 the jvenn web-server (bioinfo.genotoul.fr/jvenn/example.html)(73), and other annotation statistics were tabulated using the gene- which calculated the overlaps between all combinations of the stats function of the SNAP package (Table 2). Mean intron and exon sizes for H. magnipapillata and N. vectensis were calculated lists of OGs and created a four-way Venn Diagram to visualize from the annotated scaffolds from the Joint Genome Institute. these overlaps (Fig. S3). An independent estimate of intron size for K. iwatai was per- formed by mapping the RNA contigs onto the DNA contigs Analysis of Gene Pathways and Candidate Genes. For each candidate (a total of 23,393 introns were evaluated). This method revealed gene, a sequence from H. magnipapillata (Dataset S2) was used as a similar mean intron length estimate (i.e., 85.4 bp). a query for performing a tblastx search (e-value cutoff: 1e-03) against the genome and transcriptome assemblies. To confirm Gene Enrichment Analysis of Predicted Genes from Annotated Genomes. their cnidarian identity, assembly sequences with significant hits The gene contigs longer than 300 bp, predicted by MAKER2, for the were then BLAST-searched against the NCBI NR sequence K. iwatai and P. hydriforme assembly were annotated using the database using the blastx algorithm. We also assessed the com- Trinotate pipeline. The GO terms provided by the Trinotate an- pleteness of candidate signaling pathways in our assemblies using notation (68) were then analyzed and compared with those assigned KEGG. Genomic and transcriptomic materials were combined to the transcriptome of K. iwatai and P. hydriforme and to the into one file per species and sent through the KEGG Automatic H. magnipapillata and N. vectensis protein sequences, as described Annotation Server (KAAS) for ortholog assignment and pathway for the gene enrichment analysis of transcriptomes. mapping (www.genome.jp/tools/kaas/). The KAAS assigned KEGG OrthoMCL Analysis. The OrthoMCL database (42) was used to de- orthology (KO) terms for each species dataset using the single- termine the number of orthologous groups identified in the tran- directional best-hit method against a representative eukaryotic scriptome assemblies of K. iwatai and P. hydriforme, compared with dataset. After assignment of KO terms, completeness of the candi- published predicted proteins in H. magnipapillata and N. vectensis. date pathways compared with their canonical pathway was assessed ORFs were predicted and sequences translated using the OrfPre- and visualized in each species using the KEGG Mapper tool (www. dictor server (72). Predicted proteins for H. magnipapillata were genome.jp/kegg/tool/map_pathway1.html).

Chang et al. www.pnas.org/cgi/content/short/1511468112 2of6 1/73 Nanomia bijuga 0.2 1/99 Hydractinia polyclina Clytia hemisphaerica Hydra magnipapillata larynx Craspedacusta sowerbyi Tripedalia cystophora Cubozoa Cnidaria Cyanea capillata Stomolophus meleagris Scyphozoa Cnidaria Aurelia aurita Polypodium hydriforme Cnidaria Tetracapsuloides bryosalmonae Buddenbrockia plumatellae Kudoa iwatai Myxozoa Enteromyxum leei Sphaeromyxa zaharoni Thelohanellus kitauei Myxobolus cerebralis Gorgonia ventalina Eunicella cavolinii Octocorallia Cnidaria Pocillopora damicornis Montastraea faveolata Porites australiensis Acropora cervicornis Nematostella vectensis 1/93 Edwardsiella lineata Hexacorallia Cnidaria eques tuediae Hormathia digitata Aiptasia pallida Saccoglossus kowalevskii Strongylocentrotus purpuratus Patiria miniata Latimeria menadoensis Homo sapiens Deuterostomia Petromyzon marinus 1/68 Ciona intestinalis 1/94 Botryllus schlosseri Branchiostoma floridae 1/- Pediculus humanus 1/- Nasonia vitripennis Daphnia pulex Ecdysozoa Strigamia maritima Ixodes scapularis Helobdella robusta Capitella teleta Octopus vulgaris Lophotrochozoa Ilyanassa obsoleta Crassostrea gigas Trichoplax adhaerens Placozoa Sycon coactum Sycon ciliatum Leucetta chagosensis Corticium candelabrum Oscarella carmela Oscarella sp Euplectella aspergillum Porifera Aphrocallistes vastus Sarcotragus fasciculatus Chondrilla nucula Ephydatia muelleri Petrosia ficiformis Amphimedon queenslandica Euplokamis dunlapae Vallicula multiformis Pleurobrachia pileus Ctenophora Mnemiopsis leidyi Beroe abyssicola Monosiga ovata Acanthoeca sp Stephanoeca diplocostata Choanoflagellata 1/- Salpingoeca sp Monosiga brevicollis Ministeria vibrans Capsaspora owczarzaki Sphaeroforma arctica Amoebidium parasiticum Ichthyosporea

Fig. S1. Phylogenetic tree generated from a matrix of 41,237 amino acid positions, which excludes ribosomal genes, and 77 taxa using Bayesian inference under the CAT model. Support values are indicated only for nodes that did not receive maximal support. Bayesian posterior probabilities/ML bootstrap supports under the PROTGAMMAGTR are given near the corresponding node. A minus sign (‘‘–’’) indicates that the corresponding node is absent from the ML bootstrap consensus tree.

Chang et al. www.pnas.org/cgi/content/short/1511468112 3of6 A Nanomia bijuga Hydractinia polyclina 1/1/56 Clytia hemisphaerica Hydra magnipapillata Hydrozoa Ectopleura larynx 1/1/98 Craspedacusta sowerbyi Tripedalia cystophora Cubozoa Cyanea capillata Stomolophus meleagris Scyphozoa Aurelia aurita Polypodium hydriforme Tetracapsuloides bryosalmonae Buddenbrockia plumatellae Myxozoa Kudoa iwatai Enteromyxum leei Sphaeromyxa zaharoni Thelohanellus kitauei Myxobolus cerebralis Gorgonia ventalina Octocorallia 1/1/- Eunicella cavolinii Acropora cervicornis Porites australiensis Pocillopora damicornis Montastraea faveolata Nematostella vectensis Edwardsiella lineata Hexacorallia Hormathia digitata Aiptasia palida 0.2

B Nanomia bijuga 1/44 Hydractinia polyclina Clytia hemisphaerica Hydra magnipapillata Hydrozoa Ectopleura larynx Craspedacusta sowerbyi Tripedalia cystophora Cubozoa Cyanea capillata Stomolophus meleagris Scyphozoa Aurelia aurita Polypodium hydriforme Tetracapsuloides bryosalmonae Buddenbrockia plumatellae Myxozoa Kudoa iwatai 0.99/100 Enteromyxum leei Sphaeromyxa zaharoni Thelohanellus kitauei Myxobolus cerebralis Gorgonia ventalina Eunicella cavolinii Octocorallia Acropora cervicornis Porites australiensis Pocillopora damicornis Montastraea faveolata Nematostella vectensis Edwardsiella lineata Hexacorallia Urticina eques Bolocera tuediae Hormathia digitata Aiptasia palida 0.2

Fig. S2. (A) Phylogenetic tree generated from a matrix of 51,940 amino acid sequences and 30 cnidarian taxa using Bayesian inference under the CAT model. Support values are indicated only for nodes that did not receive maximal support. Bayesian posterior probabilities under the CAT model/Bayesian posterior probabilities under the CAT-GTR model/ML bootstrap supports under the PROTGAMMAGTR are given near the corresponding node. A minus sign (‘‘–’’)in- dicates that the corresponding node is absent from the ML bootstrap consensus tree. (B) Phylogenetic tree generated from a matrix that excludes ribosomal genes, comprising 41,237 amino acid sequences and 30 cnidarian taxa using Bayesian inference under the CAT model. Support values are indicated only for nodes that did not receive maximal support. Bayesian posterior probabilities/ML bootstrap supports under the PROTGAMMAGTR are given near the corre- sponding node.

Chang et al. www.pnas.org/cgi/content/short/1511468112 4of6 Myxobolus Kudoa Polypodium

8000

7000

6000

5000

4000

3000 Number of contigs

2000

1000

0

Contig lengths (bp)

Fig. S3. Sequence size distribution of the assembled transcriptome sequences.

Response to stress Morphogenesis Signal transduction Hydrolase activity Cell differentiation Cytoplasm Organelle organization and biogenesis Transport Protein metabolism Cell communication Transferase activity Binding Biosynthesis Nucleobase, nucleoside, nucleotide and nucleic acid metabolism Development Cell organization and biogenesis Intracellular Cell Catalytic activity Metabolism 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% % Mapping to GO term

Kudoa (genome) Kudoa (transcriptome) Polypodium (genome) Polypodium (transcriptome) Hydra Nematostella

Fig. S4. GO annotation of unigenes in genomes and transcriptomes. The top 20 GO categories are shown as a percentage of total GO terms from the assemblies of K. iwatai, P. hydriforme, and the published protein sequences of H. magnipapillata and N. vectensis. Categories for which K. iwatai presents significantly fewer GO terms than other cnidarians are indicated in boldface type.

Chang et al. www.pnas.org/cgi/content/short/1511468112 5of6 Nematostella Kudoa

4063 174 Hydra Polypodium 44

2081 24

142 84 1148 355

2178

31 369

58 2201

182

Size of each list

11162 11162

5581 8021 2735 5451

# of Ortho Groups 0 Hydra Nematostella Kudoa Polypodium

Species

Fig. S5. Comparison of OGs in myxozoan and other cnidarian transcriptomes. VENN diagram comparing OGs for the OrthoMCL database from transcriptome assemblies of K. iwatai and P. hydriforme and published predicted proteins in H. magnipapillata and N. vectensis.

Table S1. Sample information and accession numbers Reads Genome/transcriptome Species DNA/RNA Platform (bp PE) BioProject BioSample SRA experiments shotgun assembly Locality

K. iwatai DNA G2x 76 PRJNA261422 SAMN03068681 SRX704259 JRUY00000000 Red Sea, Eilat K. iwatai DNA HiSEq. 2000 100 PRJNA261052 SAMN03068681 SRX702459 JRUX00000000 Red Sea, Eilat K. iwatai RNA HiSEq. 2000 100 PRJNA248713 SAMN02800925 SRX554567 GBGI00000000 Red Sea, Eilat E. leei DNA HiSEq. 2000 100 PRJNA284325 SAMN03701405 SRX1034928 LDNA00000000 Red Sea, Eilat S. zaharoni DNA HiSEq. 2000 100 PRJNA284326 SAMN03701400 SRX1034914 LDMZ00000000 Red Sea, Eilat M. cerebralis RNA HiSEq. 2000 100 PRJNA258474 SAMN02998096 GBKL00000000 California P. hydriforme RNA HiSEq. 2000 100 PRJNA251648 SAMN02837860 SRX570527 GBGH00000000 Grand Lake State Park, OK P. hydriforme DNA HiSEq. 2000 100 PRJNA259515 SAMN02998112 SRX687102 Grand Lake State Park, OK

Other Supporting Information Files

Dataset S1 (XLSX) Dataset S2 (XLSX) Dataset S3 (PDF) Dataset S4 (XLSX)

Chang et al. www.pnas.org/cgi/content/short/1511468112 6of6