<<

The genome of Aiptasia, a sea anemone model for coral symbiosis

Sebastian Baumgartena, Oleg Simakovb,1, Lisl Y. Esherickc, Yi Jin Liewa, Erik M. Lehnertc, Craig T. Michella, Yong Lia, Elizabeth A. Hambletonb, Annika Guseb, Matt E. Oatesd, Julian Goughd, Virginia M. Weise, Manuel Arandaa, John R. Pringlec,2, and Christian R. Voolstraa,2 aRed Sea Research Center, Division of Biological and Environmental Science and Engineering, King Abdullah University of Science and Technology, Thuwal 23955-6900, Saudi Arabia; bCentre for Organismal Studies, Heidelberg University, 69120 Heidelberg, Germany; cDepartment of , Stanford University School of Medicine, Stanford, CA 94305; dDepartment of Computer Science, University of Bristol, Bristol BS8 1UB, United Kingdom; and eDepartment of Integrative Biology, Oregon State University, Corvallis, OR 97331

Edited by Nancy Knowlton, Smithsonian Institution, Washington, DC, and approved August 4, 2015 (received for review July 8, 2015)

The most diverse marine ecosystems, coral reefs, depend upon a a solid platform for research on Aiptasia, we have sequenced and functional symbiosis between a cnidarian animal host (the coral) and analyzed its genome. The results have already provided important intracellular photosynthetic dinoflagellate algae. The molecular and insights into several aspects of the evolution and function of the cellular mechanisms underlying this endosymbiosis are not well symbiotic system. understood, in part because of the difficulties of experimental work with corals. The small sea anemone Aiptasia provides a tractable lab- Results and Discussion oratory model for investigating these mechanisms. Here we report on Genome Size and Assembly. We used flow cytometry to estimate the the assembly and analysis of the Aiptasia genome, which will provide haploid genome size of Aiptasia, obtaining a value of ∼260 Mb (SI a foundation for future studies and has revealed several features that Appendix, Fig. S2 A, 1–3). This value is smaller than those reported may be key to understanding the evolution and function of the en- previously for three other cnidarians (SI Appendix,TableS1), in- dosymbiosis. These features include genomic rearrangements and tax- cluding two other anthozoans, the anemone Nematostella vectensis onomically restricted that may be functionally related to the (450 Mb) (8) and the stony coral Acropora digitifera (420 Mb) (9). symbiosis, aspects of host dependence on alga-derived nutrients, However, we estimated a genome size of ∼329 Mb for N. vectensis a novel and expanded cnidarian-specific family of putative pattern- by flow cytometry (SI Appendix,Fig.S2A, 4–6), and a previous recognition receptors that might be involved in the animal–algal in- independent estimate by densitometry of Feulgen-stained embry- teractions, and extensive lineage-specific horizontal transfer. Ex- onic nuclei gave an even lower value of ∼220 Mb (10). tensive integration of genes of prokaryotic origin, including genes for To generate a genome sequence, we used DNA from a clonal antimicrobial peptides, presumably reflects an intimate association of population of aposymbiotic anemones to obtain ∼140× coverage the animal–algal pair also with its prokaryotic microbiome. by Illumina short-read sequencing plus additional sequence from long-insert mate-pair libraries (SI Appendix, Table S2). From these coral reefs | endosymbiosis | horizontal gene transfer | dinoflagellate | sequences we assembled a draft genome of ∼258 Mb comprised of pattern-recognition receptors 5,065 scaffolds and 29,410 contigs, with half of the genome found in scaffolds longer than 440 kb and in contigs longer than 14.9 kb oral reefs form marine-biodiversity hotspots that are of enor- Cmous ecological, economic, and aesthetic importance. Coral Significance growth and reef deposition are based energetically on the endo- symbiosis between the cnidarian animal hosts and photosynthetic Coral reefs form marine-biodiversity hotspots of enormous eco- SCIENCES dinoflagellate algae of the genus Symbiodinium, which live in vesicles

logical, economic, and aesthetic importance that rely energetically ENVIRONMENTAL within the gastrodermal (gut) cells of the animal and typically supply on a functional symbiosis between the coral animal and a pho- ≥90% of its total energy, while the host provides the algae with a tosynthetic alga. The ongoing decline of corals worldwide due to sheltered environment and the inorganic nutrients needed for pho- anthropogenic influences, including global warming, ocean acidi- tosynthesis and growth (1). This tight metabolic coupling allows the fication, and pollution, heightens the need for an experimentally holobiont (i.e., the animal host and its microbial symbionts) to thrive tractable model system to elucidate the molecular and cellular in nutrient-poor waters. Although the ecology of coral reefs has been biology underlying the symbiosis and its susceptibility or resilience studied intensively, the molecular and cellular mechanisms un- to stress. The small sea anemone Aiptasia is such a system, and derlying the critical endosymbiosis remain poorly understood (2). As our analysis of its genome provides a foundation for research in coral reefs face an ongoing and increasing threat from anthropogenic this area and has revealed numerous features of interest in re- environmental change (3), new insights into these mechanisms are of lation to the evolution and function of the symbiotic relationship. critical importance to understanding the resilience and adaptability of coral reefs and thus to the planning of conservation strategies (4). Author contributions: S.B., E.M.L., J.R.P., and C.R.V. designed research; S.B., O.S., Y.J.L., E.M.L., C.T.M., Y.L., E.A.H., A.G., and M.A. performed research; M.E.O., J.G., V.M.W., and C.R.V. Aiptasia is a globally distributed sea anemone that harbors en- contributed new reagents/analytic tools; S.B., O.S., L.Y.E., Y.J.L., E.M.L., J.R.P., and C.R.V. ana- dosymbiotic Symbiodinium like its Class Anthozoa relatives the lyzed data; and S.B., O.S., L.Y.E., Y.J.L., J.R.P., and C.R.V. wrote the paper. stony corals (Fig. 1 and SI Appendix,Fig.S1A) (4, 5). Aiptasia The authors declare no conflict of interest. has a range of polyp sizes convenient for experimentation and This article is a PNAS Direct Submission. is easily grown in laboratory culture, where it reproduces both Freely available online through the PNAS open access option. asexually (so that large clonal populations can be obtained) and Data deposition: The data reported in this paper are available at aiptasia.reefgenomics. sexually (allowing experiments on larvae and potentially genetic org and have been deposited in the NCBI database (accession no. PRJNA261862). studies), and it can be maintained indefinitely in an aposymbiotic 1Present address: Okinawa Institute of Science and Technology, Okinawa 904-0495, Japan. (dinoflagellate-free) state and reinfected by a variety of Symbiodi- 2To whom correspondence may be addressed. Email: [email protected] or nium strains (6, 7). These characteristics make Aiptasia ahighlyat- [email protected]. tractive model system for studies of the molecular and cellular This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. basis of the cnidarian–dinoflagellate endosymbiosis (2, 4). To provide 1073/pnas.1513318112/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1513318112 PNAS | September 22, 2015 | vol. 112 | no. 38 | 11893–11898 A This is similar to previously reported findings with other animals (11). About half of these synteny blocks, containing 1,727 genes, are shared only with one or more other cnidarians. Interestingly, al- though many of the Aiptasia HOX genes are in clusters, as in other metazoans, the detailed organization of the clusters is distinct even from that in other anthozoans, leaving the ancestral organization of the HOX cluster uncertain (15, 16) (Fig. 2B and Dataset S1.2). BCD To seek further insights into Aiptasia evolution and symbiosis, we identified the gene families with elevated rates of amino acid sub- stitution relative to other cnidarians (SI Appendix, SI Materials and Methods). Among 2,478 gene families, 143 (containing 165 genes) were found to show accelerated evolution (Dataset S1.3), but anal- Fig. 1. Phylogenetic position and different symbiotic states of Aiptasia. ysis of the functions ascribed to these families did not reveal obvious (A) Partial phylogenetic tree (see SI Appendix, SI Materials and Methods clues to the molecular basis of the endosymbiotic relationship. and Fig. S1A for details) shows Aiptasia grouped with other anthozoans among the cnidarians. Numbers on nodes denote bootstrap values. Taxonomically Restricted Genes. We found 3,107 genes whose pre- (B–D) An aposymbiotic Aiptasia polyp (B) and symbiotic polyps viewed under white light (C) or by fluorescence microscopy to visualize the dicted products had no discernible homologs in other organisms red chlorophyll autofluorescence of the endosymbiotic Symbiodinium (SI Appendix, Table S6). Of these, 1,946 were supported by tran- algae (D). scriptomic evidence. Such taxonomically restricted genes (TRGs) are thought to play important roles in the creation of evolutionary (Table 1 and SI Appendix, Table S3). Using ab initio prediction novelties and morphological diversity within the taxonomic group informed by a reference transcriptome covering several de- (17, 18). Although the TRG products lack similarity to known velopmental and symbiotic states (SI Appendix,TablesS4andS5 from which putative functions could be derived, we found and Fig. S1C), we identified 29,269 gene models, of which 26,658 that many of the TRGs were expressed differentially in different appear to be complete (Table 1) and 27,469 (including 25,370 of the states of symbiosis (Fig. 3A and SI Appendix,Fig.S4A)and/or seemingly complete genes) were supported by RNA evidence (SI between adults and larvae (SI Appendix, Fig. S4A), suggesting that Appendix,TableS6). Of the 29,269 gene models, 26,162 (89%) have at least some are real genes with specific functions, possibly in identifiable homologs in other eukaryotes and 22,993 have anno- Aiptasia-specific aspects of endosymbiosis or development. tations in the UniProt database (SI Appendix,TableS6and Dataset To search further for homologies of the TRGs, a Hidden- S1.1). The Aiptasia genome assembly and gene predictions compare Markov-Model analysis was conducted (SI Appendix, SI Materials well to those in the other sequenced cnidarians (SI Appendix,Table and Methods), identifying 52 putatively new domains. S1), and this comparison also suggests that the smaller genome size These domains were found in 122 of the putative TRGs, including of Aiptasia reflects largely the smaller sizes of introns and their 14 cases in which the two genes encoding the same seemingly new lower frequency (as indicated by a larger mean exon size) in Aiptasia domain were immediately adjacent to each other or separated by a (SI Appendix,TableS1). Indeed, as expected, the diversity of mo- single gene, suggesting recent . In addition, these lecular functions represented in the predicted Aiptasia protein set is putatively new domains were found in 145 Aiptasia proteins that “ ” comparable to those in other cnidarians and in other basal were not classified as TRG products (Dataset S1.4). Also of in- metazoans (SI Appendix,Fig.S2B and C)(11). terest was that the putative TRGs were not randomly distributed in the genome, with about one-third being present in clusters of Repeat Content and Evolution, Synteny, and Fast-Evolving Genes. The two to seven with no interspersed genes (Fig. 3B). The genes in Aiptasia genome contains ∼26% repeated sequences, intermediate between what has been reported for other cnidarians (SI Appendix, these clusters were not coexpressed (SI Appendix,Fig.S4B), and Table S1) and similar to the estimates for other basal metazoans the biological reason for the nonrandom genomic arrangement such as Capitella teleta (31%) and Lottia gigantea (20%) (11, 12). Of remains unclear. the repeated sequences, ∼36% can be assigned to previously known repetitive elements (SI Appendix,TableS7) that are in multiple Table 1. Statistics for the Aiptasia genome assembly and gene categories including a variety of known transposable elements (TEs). annotation We found at least four classes of TEs that showed an expansion at a Jukes–Cantor distance of 0.4–0.5 (Fig. 2A and SI Appendix,Fig.S3A Parameter Value and B). This ancient TE expansion aligns with the time of divergence of Aiptasia from Verrilactis paguri and Sagartia elegans, two anemone Genome size, by measurement of DNA content* 260 Mb species that are not symbiotic with dinoflagellates, based on Nei– Total size of genome assembly, in 5,065 scaffolds 258 Mb Gojobori synonymous-substitution rates in the cytochrome c oxidase Total contig size, in 29,410 contigs 213 Mb III gene (COX3)(Fig.2A). Such burst-like expansions of TEs have Scaffold N50 440 kb been reported to reflect ancient population bottlenecks (13) and/or Longest scaffold 2,344 kb speciation events, as well as major increases in genome size, such Contig N50 14.9 kb as that found in Hydra (14). Remarkably, an ancient expansion of Longest contig 228 kb Number of gene models 29,269 TEs in the stony coral A. digitifera resembles that in Aiptasia, † whereas the more closely related N. vectensis (which is not symbi- Number of apparently complete gene models 26,658 Number of predicted proteins with recognizable 26,162 otic with dinoflagellates) appears not to have had such an expansion ‡ (SI Appendix,Fig.S3A). The existence of similar repeat expansions (E-value ≤1e-5) homologs in other eukaryotes in Aiptasia and A. digitifera is intriguing, and it remains to be de- Mean predicted protein length 517 amino acids termined if these genomic rearrangements were functionally asso- Repeat content 26% ciated with some common event affecting their symbiotic lifestyles. See also SI Appendix,TablesS1,S3,andS6. kb, kilobase; Mb, megabase Despite the potential genome-rearranging effects of such TE pairs; N50, 50% of the assembly is in scaffolds or contigs longer than this value. activity (11), a total of 3,377 Aiptasia genes are found in synteny *See SI Appendix, Fig. S2A. † blocks of 3–33 genes (mean of 4.8) that are shared with one or Gene models with both predicted start and stop codons. more other metazoans (SI Appendix, SI Materials and Methods). ‡See SI Appendix, Table S6 for details.

11894 | www.pnas.org/cgi/doi/10.1073/pnas.1513318112 Baumgarten et al. A B that it needs for membrane formation, and indeed analysis of the Aiptasia predicted protein sets of Aiptasia, N. vectensis,andA. digitifera in- B. annulata HOXCHOXDa HOXDb EVX HOXA MNX ROUGH HOXE dicates that anthozoans (like Drosophila, but unlike both mammals V. paguri N. vectensis S. elegans 140 kb and Symbiodinium) are unable to synthesize sterols from glycolytic 0.012 Others DIRS intermediates (SI Appendix,Fig.S5B and Dataset S1.6). However, 0.010 EnSpm HOXEHOXC/DEVX HOXA MNX BEL most dinoflagellate-synthesized sterols are structurally distinct from 0.008 A. digitifera the bulk of those found in the cnidarian hosts (23–26), and it re- 0.006 CR1 187 kb Gypsy mains unclear whether the dinoflagellate and host can transport and 0.004

% of genome Penelope HOXD 0.002 HOXCHOXEHOXB ROUGH EVX HOXA biochemically convert the dinoflagellate-produced sterols in suffi- Aiptasia ’ 0.000 176 kb cient quantity to fulfill the host s needs in the absence of a supply 0.0 0.1 0.2 0.3 0.4 0.5 also from ingested food. As suggested previously (19), an intriguing Jukes-Cantor Distance alternative would be that the transported sterols serve a role in host Fig. 2. Genomic rearrangements in Aiptasia.(A) Correlation between a period recognition of the algal symbionts. of speciation and a period of high TE activity. (Top) Approximate times of di- vergence of Aiptasia from the anemone species S. elegans, V. paguri,andBar- Interactions of the Host with Algal Symbionts and Other Microbes. A tholomea annulata basedonNei–Gojobori synonymous substitution rates in the cnidarian host must distinguish among potential symbionts, path- COX3 gene (SI Appendix, SI Materials and Methods), which correspond to the ogens, and particles of food. Moreover, a given host can establish – Jukes Cantor distances used to estimate the times of past TE activity. Aiptasia and symbiosis with some strains of Symbiodinium but not others (6, 7), B. annulata are symbiotic with dinoflagellates; V. paguri and S. elegans are not. and there is good evidence from Hydra that cnidarian hosts also (Bottom) The dynamics of seven distinct TE classes, plotted as the cumulative percentage of the genome covered by each class (ordinate) at given Jukes–Cantor actively shape the assembly of a specific and beneficial prokaryotic substitution distances [abscissa; calculated based on the nucleotide differences microbiome (27, 28). Although understanding of microbiome as- between the individual genomic TEs and the consensus sequence for the corre- sembly in corals and anemones is much less complete, recent sponding TE family (SI Appendix, SI Materials and Methods)]. (B) Arrangements of studies suggest that processes similar to those in Hydra are in- clusters in three anthozoans. N. vectensis and A. digitifera genes are volved (29–32). Such discriminations must be accomplished in the named as described previously (15); Aiptasia genes are named based on the se- absence of an adaptive immune system and presumably depend quence similarities as shown in SI Appendix,Fig.S3C. Arrows indicate directions of on innate immunity mechanisms that involve the recognition of transcription. Blue, other genes; green, HOX and HOX-related genes. microbial cell-surface molecules by host pattern-recognition receptors (PRRs) (2, 33). Indeed, there is evidence that the – recognition of compatible Symbiodinium types depends on the Metabolic Exchanges Between the Partners. The cnidarian dinoflagellate specific binding of algal cell-surface glycans by host lectins (34). symbiosis involves an intimate metabolic exchange between the Thus, it was of interest that among the significantly enriched partners, but many of the details remain obscure. In particular, the protein domains in the Aiptasia genome (SI Appendix, Fig. S6A) respective contributions to animal metabolism from ingested food, was the fibrinogen domain, which acts as a carbohydrate-binding synthesis by Symbiodinium, synthesis by the prokaryotic micro- moiety in secreted vertebrate PRRs that triggers the innate im- biome, and synthesis by the animal cells themselves remain unclear munity cascade of the lectin-complement pathway (35). Search- in many cases. The Aiptasia genome sequence should help to elu- ing the Pfam annotation (SI Appendix, SI Materials and Methods) cidate these matters, as the following examples illustrate. First, revealed 116 Aiptasia proteins containing fibrinogen domains, of BLASTP (protein basic local alignment search tool) searches of which 13 also contain an N-terminal collagen domain as in the the gene models confirm previous transcriptomic evidence (19) bilaterian ficolins (35). Strikingly, not only has this protein family that anthozoans, like other animals, lack the ability to synthesize undergone apparently independent expansions in the symbiotic 12 of the 20 common amino acids from central-pathway inter- anemone and stony-coral lineages (Fig. 4A), but most family mediates (SI Appendix,TableS8). The missing enzymes include members (including all identified to date in Aiptasia) also contain two that are essential for the synthesis of the sulfur-containing two or three Ig domains lying between the collagen and fibrinogen amino acids, but the genome also confirms that Aiptasia, like domains (Fig. 4A). As proteins with this tripartite domain struc- SCIENCES other animals including N. vectensis (SI Appendix, Table S8) and ture appear to be absent from bilaterians and have not previously ENVIRONMENTAL several corals (9), contains a gene for a cystathionine-β-synthase, been described, we have named them Cnidarian ficolin-like pro- so that it should be able to synthesize cysteine from methionine teins (CniFLs), and we speculate that the Ig domains may con- (19). Thus, the reported absence of such a gene in several Acropora tribute to their ability to recognize a variety of microbial surface species (9) remains a puzzle. Although labeling studies have indicated that various essential amino acids are transported from the dinoflagellate to the host (20), it remains unclear if this supply ’ A B is adequate for the animals needs or whether additional sup- Apo Partial Sym (199) Observed 100 (111) plies from ingested food are also necessary. TRG clusters (30) Second, analysis of the genome sequence confirmed the pres- Random ence of five genes (19) encoding Npc2-type sterol-transport pro- 10 expectation teins and identified a sixth such gene. Interestingly, five of these genes occur in two clusters that probably reflect their evolutionary 1 Number of clusters origins by tandem duplication (SI Appendix,Fig.S5A and Dataset 0.1 2 3 4 5 6 7 S1.5). The two genes encoding proteins with typical cholesterol- Low High Cluster size (# of genes) binding sites (NPC2A and the newly identified NPC2F) are closely linked and have three introns apiece, whereas three of the four Fig. 3. Differential expression and nonrandom chromosomal clustering of genes encoding proteins that presumably cannot bind cholesterol taxonomically restricted genes (TRGs) in Aiptasia.(A) Heat map of FPKM ex- (19) are present in a second cluster and have either zero or one pression values (SI Appendix, SI Materials and Methods) for 63 putative TRGs ≥ intron. The latter genes may represent an anthozoan-specific with expression-level changes of eightfold (up or down) between partially (Partial) or fully (Sym) infected anemones and anemones without dinoflagellates expansion of the family (19), and both differential expression (19, (Apo). (B) To evaluate genomic clustering, the observed numbers of clusters of 21) and protein localization (22) data suggest that the product of two or more putative TRGs without intervening genes were compared with the one of these genes (NPC2D) is involved in transport of dinofla- expectations based on a random distribution of such genes in the genome (SI gellate-synthesized sterols to the host cytosol in symbiotic anem- Appendix, SI Materials and Methods). Error bars indicate SDs; numbers in pa- ones. Such transport may provide the host with the bulk sterols rentheses indicate the actual numbersofclustersofthosesizes.

Baumgarten et al. PNAS | September 22, 2015 | vol. 112 | no. 38 | 11895 patterns with high specificity. Interestingly, a similar expansion of A proteins containing both Ig-superfamily and fibrinogen domains (FREPs) has been identified in the snail Biomphalaria glabrata,in which the Ig domains are not only expanded in the genome but further diversified through point mutations and somatic recombi- nation of the germ-line source sequences (36). Even though the Ig domain diversity in CniFLs and FREPs falls far short of that in the antibody system of vertebrates, our findings further support the hy- pothesis that the pattern-recognition capabilities of invertebrate in- nate immunity systems are more flexible than once thought. Both analogy to the function of vertebrate PRRs and previous studies of complement components in cnidarians (37, 38) suggest that the Aiptasia CniFLs might function through the complement pathway, and indeed we were able to identify putative orthologs of most components of this pathway in the Aiptasia predicted protein set (Fig. 4B and Dataset S1.8). Accordingly, a CniFL-complement pathway might be involved in the recognition of compatible Sym- biodinium types, in shaping other aspects of the host microbiome, or both. Consistent with the former possibility, we note (i) that the CniFLs have so far been found only in symbiotic cnidarians (Aiptasia and two corals, but not N. vectensis)and(ii) that the Aiptasia CniFLs were almost all up-regulated—in some cases dramatically so—in aposymbiotic relative to symbiotic anemones (SI Appendix,Table S9). This up-regulation might reflect a role in recognition and uptake of compatible Symbiodinium cells by animals that lack them, with down-regulation following the successful establishment of symbiosis. Toll-like receptors (TLRs) and Interleukin-like receptors (ILRs) are additional PRR classes; both recognize extracellular microbial patterns and signal through the intracellular Toll/Interleukin- receptor (TIR) domain. Consistent with a previous transcriptome study (39), the current Aiptasia genome assembly has not revealed any canonical TLR with both TIR and extracellular leucine-rich- B repeat (LRR) domains, but it does contain four ILR homologs with both a TIR domain and one to three extracellular Ig domains, as well as two proteins with a TIR domain only (SI Appendix,Table S10 and Fig. S6B), which may partner with an unknown extracel- lular PRR. Consistent with previous studies (39, 40), many other predicted components of the TLR/ILR signaling pathways were identified using the Kyoto Encyclopedia of Genes and Ge- nomes (KEGG)-annotated protein sets of Aiptasia, N. vectensis, and A. digitifera (SI Appendix,Fig.S6C and Dataset S1.9), sug- Fig. 4. Cnidarian ficolin-like proteins (CniFLs), a newly recognized family of gesting that these pathways are functional in anthozoans. The close putative PRRs. (A) Maximum-likelihood phylogenetic tree of all CniFLs identi- similarity between the Aiptasia and N. vectensis TLR/ILR protein fied in Aiptasia and stony corals (A. digitifera and Stylophora pistillata)andof repertoires and the apparently lineage-specific expansion of these all canonical ficolins (Fcns) found in stony corals and the anemone N. vectensis proteins in A. digitifera (SI Appendix, Table S10 and Fig. S6B), but not (Dataset S1.7). A human and a mouse ficolin (one of three and two, re- spectively, in those species) were also included as an out-group. The tree is based in Aiptasia, suggest that these pathways are not involved in the in- on an alignment of a 113-amino-acid sequence (with gaps removed) that spans teraction between host and Symbiodinium. Instead, as suggested by portions of the collagen and fibrinogen domains. Most of the identified CniFLs studies in Hydra (41), these proteins might function in shaping the contain three central Ig domains, but five (indicated by an asterisk) contain only host-associated prokaryotic microbiome. The microbiome might be two. Numbers on nodes denote bootstrap values. (B) Components of the lectin- more complex in corals than in anemones, or coral and anemone hosts complement pathway that are encoded in the Aiptasia genome (based on KEGG might rely on different sets of mechanisms to shape their microbiomes. analysis; SI Appendix, SI Materials and Methods) and may be involved in signaling Microbes that evade the extracellular recognition systems and downstream of the CniFLs. BF, complement factor B; C2, C3, and C3b, comple- invade the animal cell can be recognized by intracellular “nucle- ment components 2 and 3 and the cleavage product of C3; C4BP, complement otide-binding and oligomerization domain” (NOD)-like receptors component 4 binding protein; CR, complement receptor; HF1, complement fac- – (NLRs), which trigger defensive responses overlapping with those tor H; MASP, mannose-binding-lectin associated serine protease. Complement component C4 was not unequivocally identified in the currently available se- of the TLRs and ILRs (27) (SI Appendix,Fig.S6C and D). NLRs quence but is included here for the sake of completeness (Dataset S1.8). are characterized by a central NACHT or NB-ARC domain and are greatly expanded and diversified in cnidarians (42, 43). A search of the Aiptasia Pfam annotation revealed 86 proteins suggests that cnidarians have a distinct mechanism for the processing containing NACHT domains and 22 proteins containing NB-ARC of signals resulting from NLR recognition of microbial patterns. domains, numbers similar to those reported for N. vectensis but much smaller than those reported for both Hydra magnipapillata Evidence for Extensive Horizontal Gene Transfer. The associations and A. digitifera. It is not clear what could explain these substantial between cnidarians and their endosymbiotic dinoflagellates and lineage-specific differences. As reported previously for H. magni- prokaryotic microbiomes have evolvedovermillionsofyears,making papillata (42), KEGG-based analysis revealed the presence in it likely that horizontal gene transfer (HGT) has occurred (44). To Aiptasia and other anthozoans of homologs of many components explore this possibility, we searched the predicted Aiptasia protein set of the vertebrate NLR-triggered pathways but, surprisingly, not of for cases in which the best alignment was to a nonmetazoan rather the centrally important “receptor-interacting protein kinases” RIPK1 than a metazoan sequence (SI Appendix, SI Materials and Methods). and RIPK2 (SI Appendix,Fig.S6D and Dataset S1.10). This result This search identified 275 HGT candidates that are specific to

11896 | www.pnas.org/cgi/doi/10.1073/pnas.1513318112 Baumgarten et al. Aiptasia (SI Appendix,Fig.S7A, Left) and an additional 548 A “cnidarian-specific” cases in which the Aiptasia protein has a hit in Aiptasia-specific Cnidarian-specific All genes one or more other cnidarians, but not in other metazoans, so that 100%

the gene is likely to have been transferred from a nonmetazoan N introns 80% source to a basal cnidarian (SI Appendix,Fig.S7A, Right and Dataset S1.11). Although the HGT candidates have a variety of 60% apparent sources, genes of putative bacterial origin predominate 40%

(SI Appendix,Fig.S7A). Consistent with origins in bacterial se- 20% quences (and/or elements arising by reverse transcription) and a 0% % of genes containing subsequent gradual accumulation of introns, both the Aiptasia- 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 specific and cnidarian-specific HGT candidates have, on average, Number of introns B fewer introns than the overall Aiptasia gene set (Fig. 5A). Similar Firmicute bacteria patterns have been observed for HGT candidates in H. magni- S. microadriaticum 1 papillata, N. vectensis, and the rotifer Adineta vaga (14, 45, 46). As 0.26 S. minutum the Aiptasia-specific HGT candidates also have, on average, fewer Aiptasia-Tox 1 0.53 0.13 Aiptasia-Tox 2 introns than the cnidarian-specific HGT candidates (Fig. 5A), the 0.25 Aiptasia-Tox 3 Aiptasia lineage has presumably continued to acquire new genes 0.95 0.46 Aiptasia-Tox 6* by horizontal transfer since its split from the other cnidarian 0.42 Aiptasia-Tox 4* lineages examined here (rather than these genes having been lost 0.47 Aiptasia-Tox 5* in the other lineages). Actinobacteria Among the HGT candidates were 29 (17 Aiptasia-specific and 12 Proteobacteria cnidarian-specific) whose best alignment was to a Symbiodinium Fig. 5. Evidence for horizontal gene transfer (HGT) in Aiptasia.(A) Cumulative- protein (SI Appendix,Fig.S7A), suggesting that they were trans- distribution plot of intron numbers in Aiptasia-specific (green) and cnidarian- ferred from Symbiodinium or another dinoflagellate(s) into cnidar- specific (blue) HGT candidates and in all 29,269 Aiptasia gene models (yellow). ian hosts (or possibly the reverse). [Because the databases used (SI Because of the smaller sample sizes of the first two categories (275 and 823, Appendix, SI Materials and Methods) contained sequences from respectively), 95% binomial-proportion confidence intervals (SI Appendix, SI Symbiodinium but not other dinoflagellates, there is ambiguity about Materials and Methods)areshownforthem.(B) Maximum-likelihood phylo- the dinoflagellate(s) involved.] To test this possibility further, we genetic tree for the Aiptasia Tox-Art-HYD1 domains apparently derived by constructed maximum-likelihood phylogenetic trees for 4 of the 12 intraspecific duplications after an HGT event. For simplicity, the numerous cnidarian-specific proteins and sequences representing the full phy- and varied bacterial species whose Tox-Art-HYD1 domains form the outer logenetic diversity in the UniProt database (SI Appendix, SI Materials branches of the tree are not shown individually (SI Appendix,Fig.S7E). Numbers on nodes denote bootstrap values. An asterisk represents the and Methods). In each case, the Aiptasia protein and its homologs in three proteins in one genomic region. other cnidarians and Symbiodinium form a distinct clade that is itself embedded in a larger clade containing almost entirely proteins from nonmetazoans (SI Appendix,Fig.S7B), supporting the hypothesis of Among the 17 Aiptasia-specific HGT candidates whose best a dinoflagellate-to-cnidarian HGT (or vice versa). alignment was to a Symbiodinium protein, seven had domains that One of the 12 cnidarian-specific genes has already been identified were annotated through the Pfam database. Three of these anno- in previous studies as a probable case of HGT (9, 47). In N. vectensis, tations were to the bacterial Tox-Art-HYD1 domain of ∼100 amino A. digitifera, and several dinoflagellates, a single gene was found to acids, which is known to act as an ADP ribosyltransferase toxin in encode a 3-dehydroquinate synthase–like domain fused to an – several bacterial pathogens (50, 51) but has not previously been O-methyltransferase like domain; these domains appear to catalyze described in eukaryotic genomes (although one crustacean se- consecutive steps in the synthesis of UV-protective mycosporine quence is annotated in the Pfam database). In addition to the three amino acids (48). Both domains from these dinoflagellate and cni- genes found in the HGT screen, we found three more Aiptasia SCIENCES darian fusion proteins cluster in phylogenetic trees with each other genes that contain Tox-Art-HYD1 domains based on Pfam anno- ENVIRONMENTAL and with the corresponding domains found in several cyanobacteria tation. Moreover, three of these six Tox-Art-HYD1 genes are (9, 47) (SI Appendix,Fig.S7C and D), although the domains are present in one genomic region of ∼64 kb without intervening genes, encoded in distinct but adjacent genes in those cyanobacteria (49). Indeed, our search of 89 cyanobacterial genomes (https://img.jgi. and these three encode proteins that are particularly closely related doe.gov/) for gene models containing a 3-dehydroquinate synthase in sequence (Fig. 5B and SI Appendix,Fig.S7E). Thus, it appears domain (as annotated by Pfam) revealed 119 genes, none of which that Tox-Art-HYD1 domain-containing genes have undergone in- contained a fused O-methyltransferase domain. We also found traspecific duplications in Aiptasia after the initial HGT event. that both domains of the predicted Aiptasia fusion protein cluster in Key residues of the Tox-Art-HYD1 domain are conserved in phylogenetic trees as described previously for the N. vectensis and Aiptasia and Symbiodinium (SI Appendix,Fig.S7E),andinaphy- A. digitifera proteins and that Symbiodinium microadriaticum logenetic analysis, the Aiptasia and Symbiodinium domains formed (Clade A1) also contains a gene encoding both domains with similar a distinct clade embedded within a large clade of bacterial se- – sequences (SI Appendix,Fig.S7C and D). As the same fusion-protein quences (Fig. 5B), supporting the hypothesis of a Symbiodinium gene is present in both dinoflagellates and cnidarians and the cnidarian Aiptasia HGT event following an initial transfer of a bacterial gene sequences of both domains are more similar to those in dinoflagellates into one partner or the other. Bacterial toxins containing Tox- than to those in cyanobacteria, it seems likely that the gene fusion Art-HYD1 domains are thought to function in interspecific occurred just once and simultaneously with the transfer from a cya- conflicts (51), and the Aiptasia and Symbiodinium proteins may nobacterium into a basal dinoflagellate (49), with a subsequent transfer be used similarly to combat pathogens and/or shape the com- into one or more cnidarian ancestors from a dinoflagellate(s) that was position of the host prokaryotic microbiome, much as species- potentially symbiotic (providing the intimate association that would specific antimicrobial peptides in Hydra appear to help shape facilitate the gene transfer). On this model, the presence of the fusion its microbiome (52). gene in N. vectensis can be explained most readily by the hypothesis that the modern species had an ancestor with a symbiotic dinoflagellate, a Conclusions possibility that is supported by the fact that at least 9 of the 11 additional Recent years have brought dramatically increased appreciation cnidarian-specific HGT candidates whose best alignment was to Sym- of the importance of understanding the molecular, cellular, and biodinium are present in N. vectensis. organismal bases of mutualistic host-microbe symbioses. Here

Baumgarten et al. PNAS | September 22, 2015 | vol. 112 | no. 38 | 11897 we have reported on the assembly and analysis of the genome insert paired-end sequencing libraries. The reference transcriptome was se- sequence and predicted protein sets of the sea anemone Aiptasia. quenced and assembled using RNA derived from different developmental Although differences are likely to exist between different sym- and symbiotic states of adult anemones from Aiptasia strain CC7 and larval biotic anthozoans, our analyses have revealed a variety of con- crosses of Aiptasia strains CC7 and H2. SI Appendix details the sequencing, served features that should help to illuminate the evolution of assembly, annotation, and genome analysis. the symbiotic lifestyle and provide the basis for the continued ACKNOWLEDGMENTS. We thank John Parkinson for help on the Symbio- development of Aiptasia as a critically needed model for coral- dinium phylogeny and Cory J. Krediet and Ved Chirayath for providing dinoflagellate endosymbiosis, which underlies one of the most pictures. Research reported in this publication was supported by the King important marine ecosystems, coral reefs. Abdullah University of Science and Technology and by Grant 2629 from the Gordon and Betty Moore Foundation. L.Y.E. and E.M.L. were sup- Materials and Methods ported by National Science Foundation Graduate Research Fellowships, a Stanford Graduate Fellowship, and National Institutes of Health Training For genome sequencing and assembly, DNA from anemones of the clonal Grant 5 T32 HG000044. E.A.H. and A.G. were supported by funding from Aiptasia strain CC7 was used for the preparation of Illumina short- and long- the Deutsche Forschungsgemeinschaft (GU 1128/3-1 to A.G.).

1. Dubinsky Z, Stambler N (2010) Coral Reefs: An Ecosystem in Transition (Springer Sci- 28. Fraune S, Bosch TC (2007) Long-term maintenance of species-specific bacterial mi- ence & Business Media, Berlin). crobiota in the basal metazoan Hydra. Proc Natl Acad Sci USA 104(32):13146–13151. 2. Davy SK, Allemand D, Weis VM (2012) Cell biology of cnidarian-dinoflagellate sym- 29. Gochfeld DJ, Aeby GS (2008) Antibacterial chemical defenses in Hawaiian corals biosis. Microbiol Mol Biol Rev 76(2):229–261. provide possible protection from disease. Mar Ecol Prog Ser 362:119–128. 3. Hoegh-Guldberg O, et al. (2007) Coral reefs under rapid climate change and ocean 30. Bayer T, et al. (2013) The microbiome of the Red Sea coral Stylophora pistillata is acidification. Science 318(5857):1737–1742. dominated by tissue-associated Endozoicomonas bacteria. Appl Environ Microbiol 4. Weis VM, Davy SK, Hoegh-Guldberg O, Rodriguez-Lanetty M, Pringle JR (2008) Cell 79(15):4759–4762. biology in model systems as the key to understanding corals. Trends Ecol Evol 23(7): 31. Vidal-Dupiol J, et al. (2011) Innate immune responses of a scleractinian coral to vib- – 369 376. riosis. J Biol Chem 286(25):22688–22698. 5. Thornhill DJ, Xiang Y, Pettay DT, Zhong M, Santos SR (2013) Population genetic data 32. Fusetani N, Toyoda T, Asai N, Matsunaga S, Maruyama T (1996) Montiporic acids A of a model symbiotic cnidarian system reveal remarkable symbiotic specificity and and B, cytotoxic and antimicrobial polyacetylene carboxylic acids from eggs of the vectored introductions across ocean basins. Mol Ecol 22(17):4499–4515. scleractinian coral Montipora digitata. J Nat Prod 59(8):796–797. 6. Schoenberg DA, Trench RK (1980) Genetic variation in Symbiodinium (=Gymnodi- 33. Kvennefors EC, Leggat W, Hoegh-Guldberg O, Degnan BM, Barnes AC (2008) An nium) microadriaticum freudenthal, and specificity in its symbiosis with marine in- ancient and variable mannose-binding lectin from the coral Acropora millepora binds vertebrates. III. Specificity and infectivity of Symbiodinium microadriaticum. P Roy Soc both pathogens and symbionts. Dev Comp Immunol 32(12):1582–1592. B-Biol Sci 207(1169):445–460. 34. Wood-Charlson EM, Hollingsworth LL, Krupp DA, Weis VM (2006) Lectin/glycan in- 7. Hambleton EA, Guse A, Pringle JR (2014) Similar specificities of symbiont uptake by teractions play a role in recognition in a coral/dinoflagellate symbiosis. Cell Microbiol adults and larvae in an anemone model system for coral biology. J Exp Biol 217(Pt 9): – 1613–1619. 8(12):1985 1993. 8. Putnam NH, et al. (2007) Sea anemone genome reveals ancestral eumetazoan gene 35. Fujita T (2002) Evolution of the lectin-complement pathway and its role in innate – repertoire and genomic organization. Science 317(5834):86–94. immunity. Nat Rev Immunol 2(5):346 353. 9. Shinzato C, et al. (2011) Using the Acropora digitifera genome to understand coral 36. Zhang SM, Adema CM, Kepler TB, Loker ES (2004) Diversification of Ig superfamily – responses to environmental change. Nature 476(7360):320–323. genes in an invertebrate. Science 305(5681):251 254. 10. Gregory TR (2015) Animal Genome Size Database. Available at genomesize.com/. 37. Kvennefors EC, et al. (2010) Analysis of evolutionarily conserved innate immune com- 11. Simakov O, et al. (2013) Insights into bilaterian evolution from three spiralian ge- ponents in coral links immunity and symbiosis. Dev Comp Immunol 34(11):1219–1229. nomes. Nature 493(7433):526–531. 38. Kimura A, Sakaguchi E, Nonaka M (2009) Multi-component complement system of 12. Putnam NH, et al. (2008) The amphioxus genome and the evolution of the chordate Cnidaria: C3, Bf, and MASP genes expressed in the endodermal tissues of a sea karyotype. Nature 453(7198):1064–1071. anemone, Nematostella vectensis. Immunobiology 214(3):165–178. 13. Lynch M, Walsh B (2007) The Origins of Genome Architecture (Sinauer Associates, 39. Poole AZ, Weis VM (2014) TIR-domain-containing protein repertoire of nine antho- Sunderland, MA). zoan species reveals coral-specific expansions and uncharacterized proteins. Dev 14. Chapman JA, et al. (2010) The dynamic genome of Hydra. Nature 464(7288):592–596. Comp Immunol 46(2):480–488. 15. Chourrout D, et al. (2006) Minimal ProtoHox cluster inferred from bilaterian and 40. Miller DJ, et al. (2007) The innate immune repertoire in cnidaria–Ancestral complexity cnidarian Hox complements. Nature 442(7103):684–687. and stochastic gene loss. Genome Biol 8(4):R59. 16. DuBuc TQ, Ryan JF, Shinzato C, Satoh N, Martindale MQ (2012) Coral comparative 41. Franzenburg S, et al. (2012) MyD88-deficient Hydra reveal an ancient function of TLR genomics reveal expanded Hox cluster in the cnidarian-bilaterian ancestor. Integr signaling in sensing bacterial colonizers. Proc Natl Acad Sci USA 109(47):19374–19379. Comp Biol 52(6):835–841. 42. Lange C, et al. (2011) Defining the origins of the NOD-like receptor system at the base 17. Tautz D, Domazet-Lošo T (2011) The evolutionary origin of orphan genes. Nat Rev of animal evolution. Mol Biol Evol 28(5):1687–1702. Genet 12(10):692–702. 43. Hamada M, et al. (2013) The complex NOD-like receptor repertoire of the coral 18. Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TC (2009) More than just or- Acropora digitifera includes novel domain combinations. Mol Biol Evol 30(1):167–176. phans: Are taxonomically-restricted genes important in evolution? Trends Genet 44. Dunning Hotopp JC (2011) Horizontal gene transfer between bacteria and animals. – 25(9):404 413. Trends Genet 27(4):157–163. 19. Lehnert EM, et al. (2014) Extensive differences in between symbiotic 45. Gladyshev EA, Meselson M, Arkhipova IR (2008) Massive horizontal gene transfer in – and aposymbiotic cnidarians. G3 (Bethesda) 4(2):277 295. bdelloid rotifers. Science 320(5880):1210–1213. 20. Wang J, Douglas A (1999) Essential amino acid synthesis and nitrogen recycling in an 46. Artamonova II, Mushegian AR (2013) Genome sequence analysis indicates that the alga–invertebrate symbiosis. Mar Biol 135(2):219–222. model eukaryote Nematostella vectensis harbors bacterial consorts. Appl Environ 21. Ganot P, et al. (2011) Adaptations to endosymbiosis in a cnidarian-dinoflagellate Microbiol 79(22):6868–6873. association: Differential gene expression and specific gene duplications. PLoS Genet 47. Starcevic A, et al. (2008) Enzymes of the shikimic acid pathway encoded in the ge- 7(7):e1002187. nome of a basal metazoan, Nematostella vectensis, have microbial origins. Proc Natl 22. Dani V, Ganot P, Priouzeau F, Furla P, Sabourault C (2014) Are Niemann-Pick type C Acad Sci USA 105(7):2533–2537. proteins key players in cnidarian-dinoflagellate endosymbioses? Mol Ecol 23(18): 48. Balskus EP, Walsh CT (2010) The genetic and molecular basis for sunscreen bio- 4527–4540. – 23. Giner JL (1993) Biosynthesis of marine sterol side chains. Chem Rev 93(5):1735–1752. synthesis in cyanobacteria. Science 329(5999):1653 1656. 24. Withers NW, Kokke WC, Fenical W, Djerassi C (1982) Sterol patterns of cultured zoo- 49. Waller RF, Slamovits CH, Keeling PJ (2006) Lateral gene transfer of a multigene region xanthellae isolated from marine invertebrates: Synthesis of gorgosterol and 23-desme- from cyanobacteria to dinoflagellates resulting in a novel plastid-targeted fusion – thylgorgosterol by aposymbiotic algae. Proc Natl Acad Sci USA 79(12):3764–3768. protein. Mol Biol Evol 23(7):1437 1443. 25. Kokke WCMC, Fenical W, Bohlin L, Djerassi C (1981) Sterol synthesis by cultured zo- 50. Zhang D, de Souza RF, Anantharaman V, Iyer LM, Aravind L (2012) Polymorphic toxin oxanthellae; implications concerning sterol metabolism in the host-symbiont associ- systems: Comprehensive characterization of trafficking modes, processing, mecha- ation in caribbean gorgonians. Comp Biochem Physiol B 68(2):281–287. nisms of action, immunity and ecology using comparative genomics. Biol Direct 7:18. 26. Yamashiro H, Oku H, Higa H, Chinen I, Sakai K (1999) Composition of lipids, fatty acids 51. Deng Q, Barbieri JT (2008) Molecular mechanisms of the cytotoxicity of ADP-ribosy- and sterols in Okinawan corals. Comp Biochem Physiol B 122(4):397–407. lating toxins. Annu Rev Microbiol 62:271–288. 27. Bosch TC (2013) Cnidarian-microbe interactions and the origin of innate immunity in 52. Franzenburg S, et al. (2013) Distinct antimicrobial peptide expression determines host metazoans. Annu Rev Microbiol 67:499–518. species-specific bacterial associations. Proc Natl Acad Sci USA 110(39):E3730–E3738.

11898 | www.pnas.org/cgi/doi/10.1073/pnas.1513318112 Baumgarten et al. 1

1 Supplemental Information

2

3 Materials & Methods

4 Organisms. Anemones of the clonal Aiptasia strains CC7 (53) and H2 (54) were used

5 in this study, with H2 animals used only in the crosses to produce larvae for

6 transcriptome analyses (see below). Although CC7 and H2 have been described

7 previously as A. pallida and A. pulchella, respectively, recent global genotyping has

8 found two genetically distinct Aiptasia populations that do not conform to previous

9 species descriptions (5). Thus, we refer to both strains simply as Aiptasia in this paper.

10 Anemones were grown as described previously (19). Briefly, animals were raised in

11 a circulating artificial seawater (ASW) system at ~25ºC with 20-40 µmol photons m-2 s-1

12 of photosynthetically active radiation on a 12 h:12 h light:dark cycle and fed freshly

13 hatched Artemia (brine-shrimp) nauplii approximately twice per week. To generate

14 aposymbiotic anemones (i.e., animals without dinoflagellate symbionts), anemones were

15 placed in a polycarbonate tub and subjected to multiple repetitions of the following cycle:

16 cold-shocking by addition of 4ºC ASW and incubation at 4ºC for 4 h, followed by 1-2 d at

17 ~25ºC in ASW containing the photosynthesis inhibitor diuron (Sigma-Aldrich #D2425) at

18 50 µM (lighting approximately as above). The putatively aposymbiotic anemones were

19 then maintained for ≥1 month in ASW at ~25ºC in the light (as above) with feeding (as

20 above, with water changes on the days following feeding) to test for possible

21 repopulation of the animals by any residual dinoflagellates. The anemones were then

22 inspected individually by fluorescence stereomicroscopy to confirm the complete

23 absence of dinoflagellates, whose bright chlorophyll autofluorescence is conspicuous

24 when they are present (Fig. 1D). The anemones used as sources of genomic DNA were

25 from tubs in which all of the animals were lacking dinoflagellates as assessed in this way. 2

26 To generate symbiotic anemones for transcriptome analysis (see below), adult

27 aposymbiotic CC7 animals were infected with the compatible Clade B Symbiodinium

28 strain SSB01, as originally isolated from Aiptasia strain H2 (54). To clarify the taxonomic

29 position of SSB01, the marker Sym15 was amplified and sequenced as

30 described previously (55). Using BLASTN, the SSB01 Sym15 was found to align best to

31 Sym15 from S. minutum (Clade B) genotypes FLAp2 (5) and rt002 (56) (E-values, 5e-

32 100 and 4e-96, respectively), and less well to Sym15 from the sister species S.

33 psygmophilum (56) (E-value, 9e-63). The sequence of the cp23S marker gene from

34 SSB01 (54) is also highly similar to that of S. minutum (56). Maximum parsimony trees

35 of the Sym15 and cp23S markers from various strains confirm that SSB01 represents a

36 genotype of the species S. minutum (Fig. S1B).

37 Estimations of genome size. To estimate the genome size of Aiptasia and compare it

38 to that of the related (but not symbiotic with dinoflagellates) anemone Nematostella

39 vectensis (Fig. 1A), nuclear DNA contents were measured using a fluorescence-

40 activated cell sorter with chicken red-blood cells (CRBC: Innovative Research #50-176-

41 866) as a well characterized reference of known genome size (2C DNA content of 2.33

42 pg or 2,279 Mb). Extraction and staining of nuclei were performed using the Partec

43 CyStainPI absolute T kit (Partec #05-5023) following the manufacturer’s protocol. Briefly,

44 cell lysates of seven aposymbiotic Aiptasia anemones (~0.5 cm long) and two N.

45 vectensis anemones (~3 cm long) were prepared in 1 ml apiece of nuclei-isolation buffer

46 using a plastic mortar and pestle. To isolate CRBC nuclei, 70 µl of cells were added to 1

47 ml of nuclei-isolation buffer without further mechanical grinding by mortar and pestle.

48 The resulting cell lysates were incubated for 15 min at 22ºC and filtered through a 40-µm

49 mesh. After staining of both separate and mixed (Aiptasia plus CRBC; Aiptasia plus N.

50 vectensis) cell lysates, their fluorescence signals were measured with a BD FACSCanto

51 II cell analyzer (BD Bioscience). 3

52 Isolation of genomic DNA, library preparation, and sequencing. To isolate high-

53 molecular-weight genomic DNA (gDNA), aposymbiotic animals (see above) were

54 processed using a Qiagen Genomic-tip 100/G kit (Qiagen #10243) following the

55 instructions for extraction of gDNA from tissues. Briefly, ~10 medium-sized animals

56 (~50-60 mg total wet weight) were homogenized in 9.5 ml of buffer G2 containing 400 µg

57 ml-1 RNase A for ~10 s in a PowerGen 125 rotor stator (Fisher Scientific) at 30,000 rpm.

58 Proteinase K (Qiagen #19133) was added to a final concentration of 1 mg/ml, and the

59 homogenate was incubated at 50°C for 2 h, centrifuged at 5,000 x g for 10 min at 4°C to

60 remove particulates, and loaded onto an equilibrated Qiagen Genomic-tip 100/G. The

61 sample was then washed and eluted following Qiagen's's instructions. The quality of the

62 isolated DNA was assessed by agarose-gel electrophoresis, and the DNA was

63 quantified using a Qubit 2.0 Fluorometer (Life Technologies).

64 Two libraries designed to yield overlapping paired-end sequences were prepared

65 using the NEBNext Ultra DNA Library Prep Kit (NEB #E7370) with insert sizes of ~180

66 and ~550 bp. These libraries were sequenced on one lane of an Illumina HiSeq2000

67 sequencer (180-bp library) and one lane of an Illumina MiSeq sequencer (550-bp library)

68 with read lengths of 2 x 101 bp and 2 x 300 bp, respectively. In addition, two mate-pair

69 libraries with insert sizes of 10 and 12 kb were prepared using the Nextera Mate Pair

70 Sample Preparation Kit (gel-plus protocol) (Illumina #FC-132-1001) and sequenced on

71 one lane of an Illumina MiSeq sequencer at a read length of 2 x 101 bp. Finally, two

72 mate-pair libraries with insert sizes of 2 and 5 kb were prepared at the Beijing Genome

73 Institute using DNA that we supplied and sequenced on one lane of an Illumina GAIIx

74 sequencer with a read length of 2 x 90 bp. The libraries used and sequences obtained

75 are summarized in Table S2. All raw sequences have been deposited in the NCBI

76 Sequence-Read Archive (SRA) as FASTQ files under the accession number

77 SRS742511. 4

78 Additional high-quality genomic sequence was obtained but not used in the final

79 assembly because it was found not to improve the assembly. These sequences have

80 also been deposited in SRA as FASTQ and SFF raw-sequence files. First, one paired-

81 end library of ~250-bp insert size was prepared with the TruSeq DNA Sample Prep Kit

82 (Illumina #FC-121-2003) and sequenced both in one lane on an Illumina GAIIx

83 sequencer at 2 x 101-bp read length (accession number SRR576330) and in one lane

84 on an Illumina HiSeq2000 at 2 x 101-bp read length (accession number SRR576326).

85 Second, two libraries of ~250-bp insert size were prepared with the Nextera DNA

86 Sample Prep Kit (Illumina #FC-121-1031); each library was then sequenced in two lanes

87 on the HiSeq2000 at 2 x 101-bp read length (accession numbers SRR606428,

88 SRR646473, and SRR646474; SRR646474 contains the reads from two lanes). Third,

89 three reduced-representation libraries (57) were produced by digesting with a restriction

90 enzyme (DraI, BamHI, or PvuII), selecting for fragments of 1-5 kb, preparing libraries

91 (with separate barcodes) with the Nextera Kit, pooling, and sequencing in a single lane

92 on the HiSeq2000 at 2 x 101-bp read length (accession number SRR609200). Finally,

93 four libraries were sequenced on a 454FLX sequencer (accession number SRX757524).

94 Genome assembly and removal of contaminating sequences. Sequences from the

95 libraries used for assembly were preprocessed with Trimmomatic (58) to trim

96 sequencing adaptors and filter out low-quality reads. The genome assembly was then

97 performed using ALLPATH-LG (59) with default settings and including the

98 HAPLOIDIFY=True setting. This de novo assembly incorporated error correction of

99 reads, contig assembly from the paired-end reads, and final scaffolding using the mate-

100 pair information. Gaps in the resulting genome assembly were closed using 10

101 iterations of GapFiller (60). A super-scaffolding step was then performed on the gap-

102 filled assembly using SSPACE (61), followed by 10 additional iterations of GapFiller to

103 close gaps introduced during the super-scaffolding step. These assembly-refinement 5

104 steps used the error-corrected paired-end and mate-pair reads that were also used in

105 the initial assembly.

106 To identify and remove scaffolds that were likely to have originated from

107 dinoflagellate, bacterial, or viral contaminants, we used a custom script to conduct

108 BLASTN searches against six databases: the genomes of S. minutum (62) and S.

109 microadriaticum (http://reefgenomics.org/blast); the NCBI complete-bacterial-genomes

110 (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gz), draft-bacterial-genomes

111 (ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/), and complete-viral-genomes

112 (ftp://ftp.ncbi.nih.gov/genomes/Viruses/all.fna.tar.gz) databases; and the viral database

113 PhAnToMe (phantome.org). As the lengths of the query and hit sequences ranged up to

114 hundreds of kilobases, a combination of cutoffs (total bit score >1000, E-value ≤1e-20)

115 was used to identify contigs with significant similarities to sequences in the databases.

116 Five scaffolds that displayed significant similarity over ≥50% of their non-N sequences

117 were considered to have originated from bacterial contaminants (Table S11) and were

118 thus removed from the final assembly.

119 The basic statistics for contigs and scaffolds were assessed as in (63) using the perl

120 script provided by the Korf laboratory

121 (http://korflab.ucdavis.edu/datasets/Assemblathon/Assemblathon2/Basic_metrics/assem

122 blathon_stats.pl). Briefly, assembled scaffolds were broken into contigs if they were

123 separated by one or more stretches of >25 N's, and statistics were calculated separately

124 for scaffolds and contigs.

125 Identification of repeats and analyses of transposable-element activity.

126 RepeatScout (64) was used to conduct de novo identification of repeat regions in the

127 genome assembly with an l-mer size of l = 15 bp. Using the default settings, 4,295

128 distinct repeat motifs were identified that occurred ≥10 times. Annotation of these

129 repeats was performed as described previously (10) using three different methods: (i) 6

130 RepeatMasker (65) using RepBase version 19.04; (ii) TBLASTX against RepBase

131 version 19.04; and (iii) BLASTX against a custom-made non-redundant database of

132 proteins encoded by transposable elements (TEs; NCBI keywords: ,

133 transposase, reverse transcriptase, gypsy, copia). The best annotation among the three

134 methods was chosen based on alignment coverage and score. Both the repeat motifs

135 identified in this way and the set of known eukaryotic TEs from RepBase (May 2014

136 release) were then used to locate and annotate the repeat elements in the assembled

137 genome using RepeatMasker (65). Repeat annotation (Table S7) is available as a track

138 on the Aiptasia Genome Browser (http://aiptasia.reefgenomics.org/jbrowse).

139 From the RepeatMasker output, the timing of TE activity was computed as

140 described previously (11). Briefly, the ages of insertions were determined for repeats

141 ≥100 bp in length by determining the total number of divergent positions (excluding

142 gaps) of the genomic sequence compared to the consensus sequence for that repeat

143 family. This number was subsequently adjusted using the Jukes-Cantor formula

144 (K=(3/4)*ln[1-i*(4/3)]) to account for multiple substitutions. For comparison, the same

145 repeat annotation and analysis of TE activity were performed for the genomes of N.

146 vectensis and A. digitifera.

147 To relate the timing of TE activity to speciation events of Aiptasia and its sister

148 species, the Nei-Gojobori distances (dS) of the COX3 genes from Aiptasia (Accession

149 No. KJ482979 (66)) and the related anemones Bartholomea annulata, Verrillactis paguri,

150 and Sagartia elegans [Accession Nos. FJ489483, FJ489503 (67), and JF833012 (68),

151 respectively] were assessed. Protein sequences were aligned using ClustalW (69) and

152 the corresponding nucleotide alignments were extracted using PAL2NAL (70). The dS

153 values were further calculated using codeml from the PAML package (71).

154 Reference transcriptome sequencing, assembly, and annotation. To help identify

155 protein-coding genes in the assembled Aiptasia genome, a reference transcriptome was 7

156 assembled using RNA derived from different developmental states in order to maximize

157 the diversity of expressed and assembled transcripts. First, adult strain CC7 anemones

158 were sampled over a 30-d time course after infecting with strain SSB01 Symbiodinium

159 cells as follows: day 1, algae were added at ~105 cells/ml to wells containing

160 aposymbiotic anemones in autoclaved and sterile-filtered seawater (AFSW); day 2, brine

161 shrimp were added as food in the usual way (see above) without changing the AFSW or

162 adding additional algae; day 3, the AFSW was changed and algae were again added at

163 ~105 cells/ml; day 11, the AFSW was changed without adding additional algae, and the

164 incubation was continued until final sampling. The anemones were held throughout

165 under the routine culture conditions described above (see 'Organisms'). Total RNA was

166 isolated in four biological replicates from aposymbiotic, partially populated (12 days), and

167 fully populated (30 days) anemones using TRIzol (Life Technologies #15596-026)

168 following the manufacturer’s instructions; all samples were taken at the mid-point of the

169 12-h light period to control for possible circadian changes in gene expression.

170 Second, larvae were obtained as described previously (7) from crosses between a

171 single CC7 anemone (male) and a single H2 anemone (female). Larvae from each of

172 two such crosses were rinsed on 70-µm-mesh filters (BD Falcon) with sterile-filtered

173 artificial seawater (FASW) and then split into duplicate infection and no-infection

174 treatments (cross 1: 2 x 6,500 larvae; cross 2: 2 x 8,400 larvae). Each set of larvae was

175 transferred to a clean glass bowl in a final volume of ~20 ml of FASW. For the infection

176 treatments, strain SSB01 Symbiodinium cells were added twice, at 3 and 4 d post-

177 fertilization, at ~2.5-5 x 104 algal cells/ml. Both infected and uninfected larvae were then

178 harvested at 10 d post-fertilization. For RNA extraction, larvae were rinsed with FASW

179 on a 70-µm-mesh filter (as above), transferred to a 50-ml Falcon tube in a final volume of

180 7.5 ml, and centrifuged for 3 min at 4,000 rpm at 4°C. The supernatant was removed,

181 and the larvae were resuspended in the remaining small volume of liquid before adding 8

182 400-500 µl of TRIzol and storing at -20°C. RNA was then extracted according to the

183 manufacturer’s instructions with the inclusion of an additional cleaning step with

184 chloroform. Total RNA was resuspended in 30 µl of RNAse-free water.

185 mRNA was isolated from all samples using Dynabeads oligo(dT)25 (Ambion #61002).

186 The quantity and quality of RNA were assessed using a Bioanalyzer 2100 (Agilent

187 Technologies, RNA Nano/Pico Chip) both after total RNA extraction and after mRNA

188 purification. Libraries were then prepared using the NEBNext Ultra Directional RNA

189 Library Prep Kit (NEB #E7420) with 180-bp insert sizes and sequenced together on one

190 lane of an Illumina HiSeq2000 sequencer with read lengths of 2 x 101 bp (Table S4).

191 The sequence reads were first trimmed by removing sequencing adaptor sequences and

192 filtered for low-quality reads using Trimmomatic (58), and the error correction module of

193 ALLPATH-LG (59) was then used to identify and remove sequencing errors. To remove

194 read pairs that originated from Symbiodinium in the libraries prepared from symbiotic

195 anemones and larvae, the sequences were mapped using bowtie2 (72) to a reference

196 transcriptome that was assembled using four different genotypes of S. minutum (NCBI

197 Project Accession No. PRJNA274852), including strain rt002, which is closely related to

198 strain SSB01 (see above). Read pairs that aligned using default parameters were

199 removed from the dataset before proceeding with assembly.

200 The remaining sequences were combined and assembled using the Trinity

201 transcriptome-assembly tool (73) using the strand-specificity information of the paired-

202 end reads. The default settings were used except for the minimum kmer coverage to be

203 included in the de Bruijn graph, which was set to 3. The final reference transcriptome

204 contained 43,770 genes with 70,499 transcripts of ≥200 bp (both "genes" and

205 "transcripts" as defined by the assembler); the mean transcript length was 1,142 bp

206 (Table S5). The genes were annotated successively against the SwissProt and TrEMBL

207 databases (74) using BLASTX (75) with a E-value cut-off of 1e-5. The final annotation 9

208 contained 16,588 genes annotated by SwissProt and an additional 5,283 genes

209 annotated by TrEMBL (but not by SwissProt). Gene ontology (GO) terms, including the

210 ancestral GO terms, were assigned to the UniProt identifiers of annotated genes through

211 the UniProt-GOA database (76).

212 Development of gene models. We identified genes by ab initio prediction based on

213 selected gene models that were obtained by aligning the reference transcriptome to the

214 genome assembly and further refinement (Fig. S1C). First, the genome assembly and

215 all 70,499 transcripts of the reference transcriptome were used for gene-structure

216 annotation using PASA (77). This yielded 39,628 gene-structure models of which

217 13,887 featured seemingly full-length coding regions (with both plausible start and stop

218 codons). To generate a high-confidence training set for ab initio gene prediction, these

219 complete gene models were subjected to several filters: (i) gene models with <3 exons

220 were excluded (to allow confident definition of intron-exon boundaries); (ii) where the

221 nucleotide sequences of gene models overlapped, only the longest one was retained;

222 (iii) where the predicted protein sequences of ≥2 gene models matched by BLASTP with

223 an E-value of ≤1e-10, only the longest model was retained; (iv) only gene models with

224 unambiguous 5’ and 3’ untranslated regions (UTRs), as assigned by PASA, were

225 retained; and (v) only gene models that lacked repeat regions, as indicated by BLASTN

226 similarity searches between the annotated repeats (see above) and the gene models,

227 were retained. These steps yielded 2,260 gene-structure models, on which Augustus

228 3.0. (78) was then trained in order to predict genes (i.e., coding sequences plus UTRs)

229 in the genome assembly using the default Augustus training pipeline.

230 To refine the ab initio predictions, we also mapped all 43,770 genes from the

231 reference transcriptome to the genome assembly using BLAT (79) to generate "hints"

232 (i.e., supplementary evidence of gene presence and location). In total, 38,205 genes

233 from the transcriptome mapped to the genome with an identity of ≥97%. Incorporating 10

234 these hints, gene predictions by Augustus resulted in 27,998 gene models, of which

235 25,352 contained a predicted complete coding region. Finally, we used PASA to

236 compare the 39,628 transcript-based gene-structure models that it had produced (see

237 above) to the predicted genomic gene set from Augustus. This step added some gene

238 models and refined others, yielding a final set of 29,269 gene models, of which 26,658

239 appeared to contain the complete coding regions.

240 To evaluate the success of the gene-model predictions, we mapped all of the

241 mRNA libraries used in the reference transcriptome assembly back to the predicted

242 gene set and evaluated other evidence for the validity of the minority of gene predictions

243 that were not supported by such mRNA evidence (see Table S6).

244 Annotation of gene models, identification of taxonomically restricted genes, and

245 identification of specific protein families. The final set of predicted proteins was first

246 annotated using the SwissProt, TrEMBL, and NCBI nr databases, and GO terms were

247 assigned using a pipeline similar to that described previously (80). Briefly, BLASTP

248 searches of all genomic protein models were carried out against the SwissProt database

249 (June 2014 release). The GO terms associated with the hits were then obtained from

250 UniProt-GOA (July 2014 release) (76). If the best-scoring hit of the BLASTP search did

251 not yield any GO annotation, further hits (up to 20, or an E-value > 1e-5, whichever was

252 reached first) were considered, and the best-scoring hit with an available GO annotation

253 was used. If there was no hit in SwissProt with an associated GO term(s), the TrEMBL

254 database (June 2014 release) was queried using the same approach (Table S6).

255 Proteins that had no matches to either database were searched against the NCBI nr

256 database (E-value cutoff of 1e-5). Finally, the predicted proteins with no hits to any of

257 the three public databases were searched against the A. digitifera

258 (http://marinegenomics.oist.jp) and Stylophora pistillata (http://reefgenomics.org/blast/) 11

259 databases, and the proteins that still had no hits were tentatively classified as

260 taxonomically restricted (Aiptasia-specific).

261 In addition, to obtain domain-based annotations of the predicted Aiptasia proteins,

262 the entire set of gene models was also annotated using PANTHER v7.0 (81) and Pfam

263 v27 (82). To identify ficolins and ficolin-like proteins, the Pfam annotation was searched

264 for proteins that had fibrinogen and collagen domains, with the former domain C-terminal

265 to the latter. The related proteins in other cnidarians were identified similarly. Similarly,

266 Toll-like and Interleukin-like receptors were identified by searching the Pfam annotation

267 for TIR domains, essentially as described previously for N. vectensis and A. digitifera

268 (39), and NOD-like receptors were sought by searching for NACHT and NB-ARC

269 domains (42).

270 Completeness of the gene set and diversity of molecular functions. As one

271 approach to assessing the completeness of the predicted genomic gene set, we

272 searched for the presence of 458 evolutionarily conserved "core eukaryotic genes"

273 (CEGs) from six model organisms (Homo sapiens, Drosophila melanogaster,

274 Arabidopsis thaliana, Caenorhabditis elegans, Schizosaccharomyces pombe, and

275 Saccharomyces cerevisiae) as proposed by CEGMA (the Core Eukaryotic Genes

276 Mapping Approach) (83). Instead of searching our gene models de novo by the CEGMA

277 pipeline, we used BLASTP (75) with an E-value cut-off of 1e-5 to search for homologues

278 of the 458 CEGs in the predicted Aiptasia proteome, as well as in the proteomes of N.

279 vectensis and A. digitifera (Fig. S2B).

280 In addition, to analyze the diversity of molecular functions, PANTHER and Pfam

281 domain counts for 23 metazoan species (Table S12) were combined in a single table

282 and a PCA was conducted using the R prcomp function [Component loadings %] (84).

283 Finally, to help determine the presence or absence of Aiptasia proteins involved in

284 key biosynthetic and signaling processes, we used the Kyoto Encyclopedia of Genes 12

285 and Genomes (KEGG) (85). First, the complete set of Aiptasia gene models was

286 annotated using the KEGG database, yielding an initial list of 13,156 genes with KEGG

287 annotations. Because these annotations were not always reliable (some matches were

288 not optimal as judged by reciprocal BLAST), we then proceeded as follows for each

289 pathway of interest. A list was produced of the KEGG ID numbers for each gene in the

290 pathway, and this list was used to search the previously generated list of KEGG

291 annotations. For each assignment, we asked if it were meaningful and (in cases where

292 there was more than one gene assigned to a given KEGG ID number) the most

293 appropriate gene by using BLASTP to ask if the best hit for the Aiptasia sequence in

294 SwissProt (see above) was indeed a protein of the expected type (requiring an E-value

295 of ≤1e-5). In cases in which these steps found no Aiptasia gene corresponding to a

296 particular pathway component, we asked if the apparent absence were real by using the

297 human (or other appropriate) sequence with that KEGG annotation to search the

298 predicted Aiptasia protein set using BLASTP with an E-value cut-off of 1e-5; the best hits

299 meeting that criterion (up to five) were then used to search SwissProt to test for matches

300 with an E-value cut-off of 1e-5. Several additional pathway components were found in

301 this way. To compare the Aiptasia gene set with those for N. vectensis and A. digitifera

302 for the same pathways, we used the appropriate annotated gene sets from the KEGG

303 database. In cases where a gene appeared to be present in Aiptasia but absent in N.

304 vectensis and/or A. digitifera, we asked if this difference were real by using BLASTP to

305 search the N. vectensis (http://genome.jgi-psf.org/Nemve1/Nemve1.home.html) or A.

306 digitifera (http://marinegenomics.oist.jp) gene sets with the Aiptasia protein. In cases

307 where a gene appeared to be absent in Aiptasia but present in N. vectensis and/or A.

308 digitifera, the reliability of the latter assignments was tested by using BLASTP against

309 SwissProt as described above. For visualization of KEGG pathways, a reference KEGG 13

310 list for submission to the KEGG Search&Color tool

311 (http://www.genome.jp/kegg/tool/map_pathway2.html) is provided (Dataset S1.12).

312 Analyses of differential gene expression. The error-corrected paired-end mRNA

313 reads used as input for the reference transcriptome assembly (see above; Table S4)

314 were mapped to the genomic gene set using bowtie2 (72) with settings “-a -t -X 500 --no-

315 unal --rdg 6,5 --rfg 6,5 --score-min L,-.6,-.4 --no-discordant --no-mixed --phred33”. The

316 mapping output was piped into SAMtools (86) for generation of a bam file and sorting.

317 The read-count abundances were then calculated using eXpress (87) with settings “--rf-

318 stranded”. Count tables for "fragments per kilobase of exon per million mapped reads"

319 (FPKM) values and unnormalized read counts were generated using a custom perl script,

320 and expression levels were subsequently calculated in R (84) using the unnormalized

321 read counts and the package edgeR (88). Heat maps were generated using FPKM

322 values imported to the Multiple Experiment Viewer (MeV) and clustered using Euclidean

323 distances (89). MDS plots were created in edgeR using "plotMDS" where distances

324 between pairs of RNA samples correspond to the leading log2-fold-changes [i.e.,

325 average (root-mean-square) of the largest absolute log2-fold-changes] (88).

326 Comparative genomic analyses. For all comparative genomic analyses, the same

327 genome assemblies were used as described previously (11) (Table S12).

328 Construction of phylogenetic tree. For construction of a phylogenetic tree, gene

329 families for 14 species were built using a previously described orthology method (12).

330 Only gene families with at least one ortholog in each species were considered, which

331 resulted in 952 gene families. Gene families with multiple paralogs in a single species

332 were reduced to one gene with the best cumulative BLASTP score to all out-group

333 species (to retain only the slowest evolving and/or best-predicted gene models).

334 Alignments were done using MUSCLE (90), and poorly aligned regions were trimmed

335 using GBlocks (91). The final alignment had 150,747 amino-acid positions and no gaps. 14

336 Phylogenetic inference was conducted with MrBayes (92) using the GTR model (93) with

337 four chains and 1,000,000 generations.

338 Analysis of synteny and of HOX genes. Synteny was analyzed as described

339 previously (11) requiring blocks of ≥3 genes with ≤10 intervening genes. We compared

340 gene-family clustering (see above and Fig. S1A) and PANTHER orthology assignments

341 (v7.0 database and annotation pipeline) (81) and found that marginally more syntenic

342 blocks could be detected with PANTHER, which was thus used for all subsequent

343 synteny analyses. All synteny blocks are available as tracks on the genome browser

344 (http://aiptasia.reefgenomics.org/jbrowse) with species information.

345 Aiptasia HOX-like genes were identified by BLASTP searches of the N. vectensis

346 HOX genes (15) against the Aiptasia protein set (E-value cutoff 1e-5). The gene

347 identities were confirmed by protein alignments of the homeobox domains using

348 ClustalW (69).

349 Analyses of gene-function expansions and fast-evolving gene families. To identify

350 families of protein domains that are specifically expanded in the Aiptasia lineage, we

351 conducted Fisher’s exact test (94) for PANTHER categories for genomic gene sets,

352 comparing in-group counts (Aiptasia) to average counts in the out-groups (all other

353 species in the analysis). This test was iterated over all domains, and the P-values

354 obtained were corrected with the Bonferroni correction (95) to identify the significantly

355 expanded domain families. To visualize these expansions, counts were normalized by

356 the total domain-family count in each species, and significantly expanded gene families

357 were plotted using R’s heatmap.2 function (from package gplots) (84).

358 To check for Aiptasia gene families that show elevated substitution rates in their

359 amino-acid sequences, we used the gene-family clustering (see above) and required

360 Aiptasia, N. vectensis, A. digitifera, H. magnipapillata, and human (as an out-group)

361 sequences to be present in each cluster. This allowed assessment of 2,478 gene 15

362 families. We constructed alignments using MUSCLE (90) and ran FastTree (96)

363 maximum-likelihood phylogeny inferences on them using the default settings and the

364 generalized time-reversible (GTR) model (93). We used a stringent cutoff to identify

365 fast-evolving gene families in which the Aiptasia sequences have a longer branch length

366 relative to the root node of all other cnidarian and human sequences.

367 Analysis of the taxonomically restricted genes. To seek evidence for TRG function

368 and to ask if clusters of TRGs (see below) were expressed coordinately, we examined

369 their expression as described above in the different developmental and symbiotic states

370 used for transcriptome analysis (Table S4). To search for distant homologies of the

371 TRGs, we computed Hidden Markov Models (HMMs) using hmmer (http://hmmer.org)

372 based on the Aiptasia paralogs identified by the gene-family clustering (see above). In

373 total, 126 TRGs could be assigned to 54 HMMs (i.e., protein motifs). Identification of

374 distant hits to proteins in the UNIREF90 database (97) and to other Aiptasia proteins

375 was performed using hmmsearch. Two of the newly identified motifs had a distant hit to

376 sequences in UNIREF90 (97), and the remaining 52 (from 122 TRGs) were found in a

377 total of 145 other Aiptasia proteins. Finally, to ask if the TRGs were nonrandomly

378 organized in the genome, we compared their observed frequency of clustering to a

379 model of randomness generated by using a custom script to shuffle the order of all

380 genes in the genome 100 times.

381 Analyses of protein phylogenies. ClustalW (69) was used with default settings for

382 amino-acid-sequence alignments of ficolins and ficolin-like proteins and of Toll-like and

383 Interleukin-like receptors. In both cases, the corresponding mouse and human

384 sequences were downloaded from UniProt and included as the out-group. Alignment

385 gaps were removed with trimAl (98), and the best-fitting substitution model was identified

386 using prottest3 (99). Maximum-likelihood phylogenetic trees were constructed using 16

387 MEGA v6 (100) with 100 bootstrap replicates, and the final trees were visualized with

388 FigTree (v1.4.2, http://tree.bio.ed.ac.uk/software/figtree/).

389 For the phylogenies of the 3-DHQS and O-MT protein domains, the 3-DHQS and O-

390 MT domains were identified from Pfam annotations (see above), and the corresponding

391 protein sequences were trimmed to the edges of the annotated domains, and aligned

392 using ClustalW (69) with default settings. Neighbor-joining trees were calculated in

393 Geneious R8 (101) using the JC-model and visualized in FigTree (v1.4.2,

394 http://tree.bio.ed.ac.uk/software/figtree/).

395 For the phylogeny of the Tox-Art-HYD1 domains, we used MAFFT (102) with

396 default settings and included the eight Aiptasia and Symbiodinium domains together with

397 numerous bacterial reference sequences (50). Gaps in the domain alignment were

398 removed with trimAl (98), and the best fitting amino-acid-substitution model was

399 computed using prottest3 (99). The maximum-likelihood tree was calculated and

400 visualized as described above for the ficolin-like proteins.

401 Genome-wide analysis of horizontal gene transfer (HGT). To search for Aiptasia

402 genes that may have been derived by horizontal transfer from non-metazoans, we first

403 constructed 12 protein-sequence databases that did not overlap among themselves.

404 Ten represented five metazoan (human; rodents; non-human, non-rodent mammals;

405 non-mammal vertebrates; and non-cnidarian invertebrates) and five non-metazoan

406 (plants, fungi, bacterial, archaea, and viruses) protein sets. These databases were

407 downloaded from UniProt, retaining their original taxonomic divisions (see

408 ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/taxonomic_di

409 visions/README), and filtered for organisms with nominally complete proteomes to

410 reduce the search space (103). In addition, a cnidarian database was constructed by

411 downloading the protein sets available from the public databases (N. vectensis from

412 TrEMBL and H. magnipapillata from nr) and supplementing with those for A. digitifera 17

413 (http://marinegenomics.oist.jp) and S. pistillata (http://reefgenomics.org/blast). Finally, a

414 Symbiodinium database contained the predicted protein sets from S. minutum (clade B1)

415 (62) and S. microadriaticum (clade A1) (http://reefgenomics.org/blast). In total, the 12

416 filtered databases covered 5,570 species. We then conducted BLASTP alignments of all

417 predicted Aiptasia proteins to the 12 databases and saved the results in XML format for

418 post-processing. Only hits with e-values ≤1e-5 were retained for further analysis to

419 reduce spurious hits.

420 We then identified HGT candidates as described previously (103). Briefly, for each

421 gene, we calculated the HGT index (hu) as the bitscore difference between the highest-

422 scoring non-metazoan hit and the highest-scoring metazoan hit. As previous results

423 indicate that the ratio of true non-metazoan-origin hits to true metazoan-origin hits

424 reaches a maximum at hu = 30 and then plateaus (103), we considered genes with hu

425 ≥30 as candidate HGT genes. To identify HGT candidates that were present in other

426 cnidarians as well as Aiptasia, we repeated the bitscore calculations with the cnidarian

427 database excluded. To determine the distributions of intron numbers in the Aiptasia-

428 specific and cnidarian-specific HGT candidates and compare them to the overall

429 distribution for Aiptasia genes, exon information was parsed from the genome-feature

430 file by summing up the number of exons for each predicted gene model. For the

431 Aiptasia-specific and cnidarian-specific genes, the 95% binomial-proportion confidence

432 intervals (CIs) were calculated using the formula CI = p ± 1.96√(1/n)p(1-p), where p is

433 the proportion of genes with ≤N exons and n is the total number of Aiptasia-specific or

434 cnidarian-specific genes.

435 To further test the possibility that HGTs have occurred between Symbiodinium and

436 Aiptasia, we constructed maximum-likelihood phylogenetic trees. BLASTP searches

437 were conducted against both the complete UniProt database and the cnidarian and

438 Symbiodinium databases (see above), and the best 1,000 hits (E-value ≤1e-5) from 18

439 each database were retained while allowing a maximum of two hits from any one

440 species. Sequences were then aligned using MUSCLE (90) with default parameters,

441 and regions of low alignment quality were trimmed using trimAl (98), with the in-built

442 “gappyout” parameter. Maximum-likelihood trees were then computed with 100

443 bootstraps using RAxML with the following parameters: -m PROTGAMMAJTT -x 12345 -

444 p 12345 -N 100 -f a -T 3.

445 Although we attempted to remove all alien scaffolds during genome assembly (see

446 above and Table S11), it seemed possible that the Tox-Art-HYD1-domain-containing

447 genes apparently present in Aiptasia could actually have been derived from

448 contaminating bacterial sequences. To address this possibility, we examined the

449 UniProt and nr annotations of the five genes lying upstream and the five genes lying

450 downstream of each Tox-Art-HYD1-domain-containing gene (or gene cluster). As

451 expected for bona fide Aiptasia genes, the predicted products of 35 of the 40 upstream

452 and downstream genes have their closest homologues in bilaterians (n = 25), N.

453 vectensis (n = 6), a sponge (n = 1), a cellular slime mold (n = 2), or a plant (n = 1). Of

454 the remaining five genes, four had no annotation, and just one had its best match to a

455 bacterial (Clostridium) sequence.

456 19

457 Supplementary Tables 458 459 Table S1: Overview of the Aiptasia genome assembly and comparison to other 460 published cnidarian genomes 461 Parameter Aiptasia a N. vectensis b A. digitifera c H. magnipapillata d Genome size (Mb) 260 329 e 420 1,300 Assembly size (Mb) 258 356 419 852 Total contig size (Mb) 213 297 365 785 Total contig size as % of 82.5 83.4 87.0 92.2 assembly size Contig N50 (kb) 14.9 19.8 10.9 9.7 Scaffold N50 (kb) 440 472 191 92.5 Number of gene models 29,269 27,273 23,668 31,452 Number of complete gene 26,658 13,343 16,434 NA g models f Mean exon length (bp) h 354 208 230 NA g Mean intron length (bp) h 638 800 952 NA g Mean protein length 517 331 424 NA g (number of amino acids) h Per cent repetitive DNA i ~26 26 13 57 462 a This study. 463 b Putnam et al. (8) 464 c Shinzato et al. (9) 465 d Chapman et al. (14) 466 e Value obtained in this study using a fluorescence-activated cell sorter and comparison 467 to the values obtained for Aiptasia and chicken red blood cells (see Fig. S2D-F). The 468 previously reported genome size was 450 Mb (8) 469 f Gene models with both predicted start and stop codons. 470 g Data not provided in the original publication (14). 471 h Statistics were calculated as in the Assemblathon2 study (63). 472 i As a per cent of the total contig size; see SI Materials and Methods, Table S7, and the 473 references cited for details. 20

474 475 Table S2: Overview of genomic libraries and of the sequences obtained a 476 Insert Library Read Read SRA Library size size Sequencer length pairs Cover- accession type b (bp) (Gb) used (bp) (X 10-6) age number Paired-end 180 25 HiSeq 2000 101 123.5 96X SRX757328 Paired-end 550 11 MiSeq 300 18.7 43X SRX757521 Mate-pair 2,000 9 GAIIX 90 48.0 33X SRX757522 Mate-pair 5,000 6 GAIIX 90 33.1 23X SRX757523 Mate-pair 10,000 3 MiSeq 101 12.8 10X SRX757529 Mate-pair 12,000 3 MiSeq 101 12.4 10X SRX757530 477 a All sequences have been deposited in the NCBI Sequence Read Archive (SRA) under 478 the accession numbers indicated. 479 b All DNA was obtained from aposymbiotic animals of clonal anemone strain CC7 (see 480 SI Materials and Methods). 21

481 Table S3: Assembly statistics for the Aiptasia genome assembly 482 Overview statistics Genome size 260,000,000 bp Total size of scaffolds 258,189,365 bp Number of scaffolds 5,065 N50 scaffold length 440,280 bp Total size of contigs 212,903,857 bp Number of contigs 29,410 N50 contig length 14,850 bp Mean number of contigs per scaffold 5.8 Average length of gaps (for gaps of >25 Ns) 1,860 Scaffold statistics Longest scaffold 2,343,502 bp Shortest scaffold 901 bp Number of scaffolds >500 bp 5,065 Number of scaffolds >1 kb 5,018 Number of scaffolds >10 kb 1,333 Number of scaffolds >100 kb 589 Number of scaffolds >1 Mb 27 Mean scaffold size 50,975 bp Median scaffold size 2,593 bp Contig statistics Longest contig 228,246 bp Number of contigs >500 bp 28,922 Number of contigs >1 kb 27,452 Number of contigs >10 kb 6,263 Number of contigs >100 kb 24 Number of contigs >1 Mb 0 Mean contig size 7,239 bp Median contig size 3,370 bp 483 22

484 Table S4: Overview of cDNA libraries and sequencing type used to generate the 485 reference transcriptome for gene modeling 486 Read SRA Sample a (no. of Insert Sequencer Read pairs accession biological replicates) size used length (X 10-6) number Aposymbiotic larvae (2) 180 HiSeq 2000 2 x 101 22.1 SRX757531 Symbiotic larvae (2) 180 HiSeq 2000 2 x 101 16.8 SRX757532 Aposymbiotic adults (4) 180 HiSeq 2000 2 x 101 46.0 SRX757525 Symbiotic adults - 180 HiSeq 2000 2 x 101 51.4 SRX757526 partially populated (4) Symbiotic adults (4) 180 HiSeq 2000 2 x 101 50.2 SRX757528 487 a See SI Materials and Methods for further description of samples. 488 23

489 Table S5: Assembly statistics for the Aiptasia reference transcriptome 490 Minimum k-mer coverage 3 Number of read pairs used for the assembly (i.e. after 143,660,138 preprocessing) Number of genes 43,770 Number of transcripts 70,499 Total length of assembled sequence 80,542,396 bp Maximum transcript length 37,202 bp Mean transcript length 1,142 bp N50 transcript length 2,024 bp Number of annotated genes (SwissProt) 16,588 Number of annotated genes (TrEMBL but not SwissProt) 5,283 Number of annotated genes (total) 21,871 491 a See SI Materials and Methods for further description of the analysis. 492 493 24

494 Table S6: Annotation statistics for the Aiptasia gene models 495 Total number of gene models 29,269 100% Number of apparently complete gene models a 26,658 91% Number of gene models supported by mRNA evidence b 27,469 94% Number of gene models without support from mRNA evidence 1,665 5.7% but with other evidence that they are bona fide genes c Number of predicted proteins with associated GO terms based 18,876 d 65% on SwissProt annotations Number of predicted proteins without SwissProt annotations but 4,117 d 14% with associated GO terms based on TrEMBL annotations Number of predicted proteins without SwissProt or TrEMBL 2,052 7% annotations but annotated against nr database Number of predicted proteins with no annotations against 4,224 14% SwissProt, TrEMBL, or nr Number of predicted proteins with no annotations against 1,117 3.8% SwissProt, TrEMBL, or nr that have hits (E-value ≤1e-5) in the A. digitifera or S. pistillata predicted protein sets e Number of potentially species-specific genes f 3,107 11% Number of potentially species-specific genes found in the 1,946 6.6% reference transcriptome g Number of potentially species-specific genes with ≥1 domain 113 h 0.004% annotations by PANTHER or Pfam Total number of predicted proteins with PANTHER annotations 21,047 72% Total number of predicted proteins with Pfam domains 18,716 64% Total number of predicted proteins with KEGG annotations 13,156 45% 496 a Gene models with both predicted start and stop codons. 497 b Represented by ≥1 read pair in the reference transcriptome (see SI Materials and 498 Methods). 499 c 1,404 had a hit (E-value ≤1e-5) in SwissProt, TrEMBL, or nr. Of the gene models 500 without such a hit, 257 were seemingly complete (with both predicted start and stop 501 codons), and 20 of these also had a hit in PANTHER or Pfam. Finally, another four 502 gene models were not classified as complete but did have a hit in PANTHER or Pfam. 503 The identification of seemingly bona fide genes without mRNA evidence indicates that 504 annotation using both transcript-based and ab initio gene prediction yielded a more 505 comprehensive genomic gene set than would have been obtained with either method 506 alone. 25

507 d Of these 22,993 hits, 20,437 had E-values <1e-9, and 16,648 had E-values <1e-19, 508 indicating that the majority of the annotations were based on high-confidence alignments 509 to the SwissProt or TrEMBL database.

510 e The predicted protein sets for A. digitifera and S. pistillata are available at 511 http://marinegenomics.oist.jp and http://reefgenomics.org/blast, respectively.

512 f That is, gene models whose predicted products have no hits (E-value ≤1e-5) in 513 SwissProt, TrEMBL, nr, or the A. digitifera or S. pistillata predicted protein sets.

514 g Identified by BLASTN alignment of the reference transcriptome to the potentially 515 species-specific genes (E-value ≤1e-100).

516 h Of these, 68 were also among the 1,946 found in the reference transcriptome. 517 26

518 Table S7: Aiptasia repeat content 519 Number of Number of bp Element a occurrences b covered b

DNA Transposon 18,478 2,937,550 Polinton 1,968 506,870 EnSpm 4,359 488,942 1,245 294,919 Mariner 2,060 238,758 hAT 2,064 234,782 Merlin 1,259 201,578 Sola 789 188,394 Harbinger 1,806 186,924 PiggyBac 656 169,934 ISL2EU 523 125,458 Kolobok 399 93,247 IS4 277 74,196 Crypton 328 70,067 MuDR 448 27,782 P 99 19,884 Chapaev 120 11,038 34 2,238 Tc1 12 800 Rehavkus 16 737 Zator 6 376 Charlie 7 283 Novosib 2 185 Tc2 1 158 Tc3 1 66 LTR Retrotransposon 11,751 2,409,853 Gypsy 5,417 1,225,797 BEL 2,108 515,343 DIRS 2,002 450,280 Copia 2,162 196,873 Troyka 30 19,883 27

ROO 32 1,677 Non-LTR Retrotransposon 15,549 2,707,245 CR1 3,387 755,083 Penelope 5,145 688,295 RTE 1,420 307,343 L1 984 172,956 Crack 510 106,681 L2 655 94,609 Rex 454 89,327 CRE 262 85,140 RTEX 446 79,067 Daphne 321 78,030 SINE 669 74,191 Tx1 276 36,240 Perere 183 30,515 Nematis 109 25,202 R4 85 25,185 Poseidon 223 19,339 Jockey 200 11,498 Neptune 15 6,442 I 56 5,202 Hero 17 5,044 R2 27 4,890 R1 28 1,975 Tad1 27 1,428 Ingi 15 1,008 Loa 15 990 7SL 2 598 Proto 11 596 Outcast 4 225 Nimbus 1 65 Rtetp 1 42 RandI 1 39 103 5,491 28

Low complexity 34,001 5,847,241 Simple repeat 29,362 4,949,952 Other low complexity 4,639 897,289 Satellite 390 28,153 Other c 5,817,614 Not previously described 36,110,606 d Total a 55,863,753

520 a Categories of elements are indicated in bold face. The repeated genes for the rRNAs, 521 tRNAs, and snRNAs (probably another 2-3Mb of repeated sequences) are not included 522 in this tabulation. 523 b Bold face indicates subtotals for the respective categories. 524 c Previously described but poorly defined repetitive elements. 525 d The most abundant individual repeat motif is an almost perfectly palindromic 220-bp 526 sequence that covers a total of 607 kb. BLASTN search of the NCBI ‘nt’ database 527 produced a hit (E-value 1e-30) to a neurotoxin-precursor gene (Accession No. 528 ABW97360) from the anthozoan Actinia equina (104). The 220-bp sequence aligns over 529 its entire length to the single intron separating the two exons of this gene. No such 530 repeat is found in the genome of N. vectensis, suggesting a recent expansion of this 531 repeat within the Enthemonae (66). 532 533 29

534 Table S8: Presence or absence of genes for amino-acid-biosynthetic enzymes in 535 Aiptasia, N. vectensis, and A. digitifera 536 Amino Query N. A. acid Enzyme sequence Aiptasia a vectensis a digitifera a Cys Cystathionine β-synthase P32582 AIPGENE Nemve1| N 25097 204029 Cys Cystathionine γ-lyase P31373 AIPGENE Nemve1| adi_v1. 510 184363 09810 Met/Cys/ Bifunctional aspartokinase/ Q9SA18 N Nemve1| N Thr/Ile/Lys homoserine 224979 b dehydrogenase Met/Cys/ Aspartokinase P10869 N Nemve1| N Thr/Ile/Lys 224979 b Met/Cys/ Homoserine P31116 N Nemve1| N Thr/Ile dehydrogenase 225948 b Met/Cys Homoserine O- P08465 AIPGENE Nemve1| adi_v1. acetyltransferase 5556 224940 06190 Met/Cys Cystathionine γ-synthase c P47164 AIPGENE Nemve1| N 11428 242678 AIPGENE 11377 Met Cystathionine β-lyase c P43623 AIPGENE Nemve1| adi_v1. 11428 242678 17863 AIPGENE 11377 Arg Bifunctional Q01217 N N N acetylglutamate kinase Arg Acetylornithine P18544 N N N transaminase Arg Ornithine acetyltransferase Q04728 N N N Phe/Trp/ Pentafunctional AROM P08566 N N Tyr polypeptide Phe/Trp/ Chorismate synthase P28777 N Nemve1| N Tyr 223774 b Trp Tryptophan synthase Q42529 N Nemve1| N 152188 b His ATP Phosphoribosyl- P00498 N N N transferase His Histidine biosynthesis P00815 N N N trifunctional protein His Histidinol dehydrogenase Q9C5U8 N N N Val/Leu/ Ketol-acid P06168 N N N Ile reductoisomerase Ile/Leu 3-isopropylmalate P04173 N Nemve1| N dehydrogenase 157396 b Lys Homocitrate synthase Q12122 N N N Lys Homoaconitate hydratase P49367 N N N 537 a Systematic names as used in the corresponding gene-model databases. 30

538 b These gene models are likely to be derived from contaminating bacterial DNA, as they 539 are present in the assembly as single-exon genes at the edges of scaffolds (i.e., not 540 surrounded by sequences of cnidarian origin), and the predicted proteins show high 541 sequence similarity to bacterial proteins upon reciprocal BLAST to SwissProt (E-values 542 1e-13, 7e-12, 0, 0, 3e-67, 2e-44). 543 c The genes for these two enzymes appear difficult to distinguish on the basis of this 544 type of sequence analysis alone. Thus, both of the query sequences (derived from 545 Saccharomyces cerevisiae) aligned with approximately equivalent E-values to the same 546 two distinct Aiptasia gene models and to the same single N. vectensis gene model, as 547 shown. These genes could apparently encode either cystathionine γ-synthases or 548 cystathionine β-lyases, and further studies will be required to distinguish these 549 possibilities. Thus, the puzzle presented by our previous transcriptome study (19), in 550 which only AIPGENE 11428 was found and identified as a cystathionine γ-synthase – 551 leaving Aiptasia apparently without a cystathionine β-lyases to encode the presumed 552 next step in a biosynthetic pathway for methionine – may have been apparent rather 553 than real. 554 555 31

556 Table S9: Differential expression of CniFL genes in aposymbiotic and symbiotic 557 anemones 558 a b c Gene Systematic gene name Log2 fold-change CniFL1 AIPGENE15899 4.3 CniFL2 AIPGENE18899 6.8 CniFL3 AIPGENE4936 2.4 CniFL4 AIPGENE11648 0.0 CniFL5 AIPGENE13073 4.6 CniFL6 AIPGENE13580 1.0 CniFL7 AIPGENE19581 3.5 CniFL8 AIPGENE7322 -2.5 CniFL9 AIPGENE6313 4.2 CniFL10 AIPGENE15890 1.7 CniFL11 AIPGENE15897 6.2 CniFL12 AIPGENE27171 4.5 CniFL13 AIPGENE27771 3.0 559 a See Figure 4. 560 b Gene names as they can be accessed at the genome browser 561 (http://aiptasia.reefgenomics.org/jbrowse). 562 c For aposymbiotic relative to symbiotic anemones. 563 564 32

565 Table S10: Anthozoan proteins containing TIR-domains 566 Protein type Aiptasia a N. vectensis a A. digitifera a TLR 0 1 4 ILR 4 4 7 TIR only b 2 2 13 567 a Shown are the numbers of genes encoding proteins of each type that were found in the 568 currently available genomic gene sets by searching the Pfam annotations for TIR 569 domains (see SI Materials and Methods and Poole & Weis (39) for details). 570 b Proteins that have a TIR domain but lack extracellular leucine-rich-repeats (like TLRs) 571 or immunoglobulin domains (like ILRs). 572 573 33

574 Table S11: Scaffolds from alien sequences that were removed from the final 575 assembly 576 Query % Identity % Coverage E-value Scaffold1789|size5486 81.1 79.4 0 a Scaffold2011|size4422 78.7 86.1 0 a Scaffold2772|size2202 86.3 99.8 0 a Scaffold4460|size1158 79.3 99.1 0 a Scaffold5003|size1008 91.1 100 0 a 577 a Hit was to the draft-bacterial-genomes database 578 ftp://ftp.ncbi.nih.gov/genomes/Bacteria_DRAFT/). 579 34

580 Table S12: Genome assemblies used for comparative analyses 581 Organismal group Used in and abbreviation Species Source of gene models analyses Vertebrates Hsa Homo sapiens (human) NCBI 36 from Ensembl 41 a, b, c, d Mmu Mus musculus (mouse) NCBI m36 from Ensembl 41 c, d Gga Gallus gallus (chicken) Ensembl v55 (August 2009) c, d on WASHUC2 assembly Xtr Xenopus tropicalis (frog) Annotation v4.2 on assembly c, d v4.1 Dre Danio rerio (zebrafish) Zv 6 from Ensembl 41 a, c, d Non-vertebrate deuterostomes Bfl Branchiostoma floridae Brafl1 JGI annotation (April a, c, d (lancelet) 2006) Cin Ciona intestinalis JGI finalized models c, d (ascidian) (December 2005), REPLACES proteome 16 Sko Saccoglossus kovalevskii JGI Annotation of d (acorn worm) Saccoglossus kovalevskii v3 Spu Strongylocentrotus NCBI gene build 2 version 1 c, d purpuratus (sea urchin) on Baylor's assembly Spur_v2.1 Ecdysozoa Dme Drosophila melanogaster BDGP 4 from Ensembl 41 a, c, d (fruit fly) Tca Tribolium castaneum (flour NCBI gene models build 1 v1 c, d beetle) based on the assembly Tcas_2.0 (September 2005) Isc Ixodes scapularis (tick) v1.1 from ftp://ftp.vectorbase. c, d org/public_data/organism_ data/iscapularis/ Dpu Daphnia pulex (water flea) FilteredModels8 from JGI a, c, d (September 2007) Cel Caenorhabditis elegans Wormbase release WS164 a, c, d (nematode worm) Lophotrochozoa Cgi Crassostrea gigas (Pacific Genome Assembly from c, d oyster) Zhang et al. (105) Pfu Pinctada fucata (pearl Genome Assembly from c, d oyster) Takeuchi et al. (106) Lgi Lottia gigantea (owl limpet) Limpet proteome from JGI a, c, d (Lotgi1@shake, FilteredModels1 table) (May 2007) Hro Helobdella robusta (leech) FilteredModels3 from JGI c, d (September 2007) 35

Cte Capitella teleta (polychaete FilteredModels2 track from a, c, d annelid worm) Capca1 on shake, pre- release of v1.0 (June 2007) Ava Adineta vaga (rotifer) Genome Assembly from Flot c, d et al. (107) Cnidaria Aip Aiptasia This study a, b, c, d (http://aiptasia.reefgenomics. org) Nve Nematostella vectensis JGI Annotation of a, b, c, d (starlet sea anemone) Nematostella genome v1 Adi Acropora digitifera (stony Assembly v1.1 from OIST a, b, c, d coral) Hma Hydra magnipapillata Chosen models from a, b, c, d (freshwater cnidarian) PASA+Augustus+Hma1 (July 2008) Others Sma Schistosoma mansoni v080508 from d (trematode worm) ftp://ftp.sanger.ac.uk/pub/path ogens/Schistosoma/mansoni/ Tad Trichoplax adhaerens FilteredModels2 from JGI a,d (placozoan) (August 2007) Aqu Amphimedon Aqu1 models (August 2008) a,d queenslandica (sponge) Sme Schmidtea mediterranea Mk4 models from d (amoeba) http://smedgd.neuro.utah.edu 582 a Construction of phylogenetic tree and analysis of molecular-function enrichment. 583 b Evaluation of fast evolving genes. 584 c Principal-component analysis of molecular function. 585 d Analysis of synteny. 586 36

587 Supplemental Figures

588

A

Aiptasia Anthozoa 1 Cnidaria Starlet sea anemone (Nematostella vectensis) 1

1 Stony coral (Acropora digitifera)

Freshwater cnidaria (Hydra magnipapillata)

Owl limpet (Lottia gigantea) Lophotrochozoa 1 1 Polychaete annelid worm (Capitella teleta) 1 Crustacean (water flea, Daphnia pulex) Ecdysozoa 1 1 Insect (fruitfly, Drosophila melanogaster)

1 Nematode worm (Ceanorhabditis elegans) 1 Human (Homo sapiens) Deuterostomia 1 Zebrafish (Danio rerio) 1 Lancelet (basal chordate, )

Placozoan (Trichoplax adhaerens)

Sponge (Amphimedon queenslandica) 0.06

B C

Genome assembly Transcriptome assembly cp23S 5,065 scaffolds 1 change 70,499 transcripts

SSB01+ rt141 rt005

rt002 Symbiodinium PASA Madracis pharensis Med4 39,628 models Bootstrap 0.89 Baja04_107 Oculina patagonica Med2 57 changes Mf1.05b Cladocora caespitosa Med7 multiple filters Madracis pharensis Med3 ‘hints’ rt064 43,770 genes rt351 Astrangia poculata 2 psygmophilum Training Set 2,260 models

Sym15 1 change Symbiodiniumminutum rt141 AUGUSTUS SSB01* rt005 27,998 gene models FLAp2 Madracis pharensis Med4 PASA rt002 Bootstrap 1 Mf10.14b.02 29,269 gene models 22 changes Mf1.05b Oculina patagonica Med2

rt064 Cladocora caespitosa Med6 589 Astrangia poculata 1

590 Fig. S1. Phylogenetic positions of the organisms used and pipeline for developing gene

591 models. (A) Animal phylogenetic tree inferred from 150,747 positions of 952 orthologous

592 protein families (see SI Materials and Methods) shows Aiptasia strain CC7 grouped with 37

593 other anthozoans. Numbers on nodes denote bootstrap values based on 1,000,000

594 generations and four chains; branch length (see scale bar) indicates numbers of amino-

595 acid changes. (B) Maximum-parsimony trees of markers cp23S and Sym15 from

596 different strains of Symbiodinium show that strain SSB01 (see SI Materials and

597 Methods) groups with other genotypes of S. minutum (left) and not with strains of S.

598 psygmophilum (right). Scale bars indicate numbers of nucleotide differences. +,

599 Accession No. JX221048 (54); *, this study. Bootstrap values are based on 1,000

600 iterations. More details of this analysis (for strains other than SSB01) are provided by

601 LaJeunesse et al. (56) (C) Pipeline for the development of genomic gene models. See

602 SI Materials and Methods for details.

603 38

604

605 Fig. S2. Genome size and diversity of molecular functions. (A) Fluorescence-activated-

606 cell-sorter profiles of propidium-iodide-stained nuclei of Aiptasia, chicken red blood cells

607 (CRBC), and Nematostella vectensis (see SI Materials and Methods). Experiment 1

608 (panels 1-3) included nuclei from Aiptasia only (1), CRBC only (2), and both types of

609 nuclei that were mixed and then stained (3). Experiment 2 (panels 4-6) was similar

610 except that N. vectensis nuclei were used instead of CRBCs. Arrows indicate the peaks

611 of interest. The different peak heights within an experiment (note different ordinate 39

612 scales) reflect the different numbers of nuclei measured; the different positions of peak

613 channels in the two experiments reflect differences in the staining conditions. Given the

614 CRBC 2C DNA content of 2,279 Mb (see SI Materials and Methods), the haploid DNA

615 contents of Aiptasia and N. vectensis are estimated to be ~260 and ~329 Mb,

616 respectively. (B) Presence of most of the 458 "core eukaryotic genes" (83) in Aiptasia

617 and other anthozoans (see SI Materials and Methods). Venn diagram shows the

618 numbers of such genes present in one, two, or all three of the gene sets. (C) Principal

619 component analysis (PCA) of molecular functions (see SI Materials and Methods). As

620 described previously (11), the first component separates the genomes of vertebrates

621 (light blue) from the invertebrates, including non-vertebrate deuterostomes (purple),

622 ecdysozoans (red), lophotrochozoans (green), and cnidarians (orange).

623 40

A B 4.0 BEL-1-I_NV 3.0 1 0.012 Aiptasia Others DIRS 2.0 0.010 EnSpm 1.0 BEL 0.008 Polinton 0.0 CR1 3.0 0.006 Gypsy BEL-4_Adi-I Penelope 2.0 0.004 1.0 0.002 0.0 0.000 6.0 CR1-16_NV5 0.025 2 A. digitifera Other

) 4.0

L1 5 0.020 DIRS 2.0 SINE 0.015 BEL 0.0 3.0 Gypsy CR1-1_NV 0.010 CR1 2.0 Penelope 0.005

% of genome %of 1.0

0.000 (x10genome %of 0.0 2.0 3 0.045 N. vectensis Other Gypsy-13-I_NV 0.040 CR1 0.035 PiggyBac 1.0 0.030 Gypsy 0.025 Helitron 0.0 0.020 Kolobok 3.0 Gypsy-31-I_NV 0.015 Satellite 2.0 0.010 Polinton 0.005 1.0

0.000 0.0 0 0.1 0.2 0.3 0.4 0.5 0.6 0 0.1 0.2 0.3 0.4 0.5 0.6 Jukes-Cantor Distance Jukes-Cantor Distance C 10 20 30 40 50 MNX N. vectensis HPRTAFSSHQ LLALERQFQLHKY LTRPQRYELATSLMLT ETQVK IWFQNRRMKWKR MNX MNX A. digitifera QPRTAFSSQQ LLTLERQFQAQKY LTRPQRYELATSLMLT ETQVK IWFQNRRMKWKR ROUGH Aiptasia PQRTT FTSEQT LKLE LEFHQNEY ITRSRR FE LAAC L SLTETQVK IWFQNRRAKDKR ROUGH N. vectensis PQRTT FTSEQT LKLE LEFHQNEY ITRSRR FE LAAC LNLTETQVK IWFQNRRAKDKR ROUGH EVX Aiptasia TYRTXFTREQLNRLEKEFLRENYVSRTRRCELANSLNLSETTIKIWFQNRRMKSKR EVX N. vectensis TYRTAFTREQLKRLEKEFMRENYVSRTRRCELANALNLSETTIKIWFQNRRMKSKR EVX EVX A. digitifera TYRTAFTREQLSRLEKEFLRENYVSRTRRSELASMLNLSETTIKIWFQNRRMKAKR HOXA Aiptasia QKR FTFTQXQXVELEKEFHFSKYLTRSRR IEIATSLKLTETQ I KIWFQXRRMKWKR HOXA N. vectensis QKR FTFTQRQLVELEKEFHFSKYLTRTRR IEIATT LKLTEMQ I KIWFQNRRMKWKR HOXA HOXA A. digitifera NKR FT FTQRQLVELEKEFHFSKY LTRTRR IE IASNLDLT ETQ I KIWFQNRRMKWKR HOXB Aiptasia KNRTIYSTKQLVELEKEFHYNRYLCRPRR IEIAQSLDLTEKQVKIWFQNRRMRWKR HOXB N. vectensis KNRTIYSTRQLVELEKEFHYNRYLCRPRR IEIAQSLELTEKQVKIWFQNRRMKWKK HOXB HOXC Aiptasia KHRTSYNNKQ LLELEKEFHFNRYLCSTRRHEIAKSMNLSERQVKVWFQNRRMKWKK HOXC N. vectensis KYRTSYTNRQLLELEKEFHYNKYLCGTRRRELANAMKLTERQVKVWFQNRR IKLKK HOXC HOXC/D A. digitifera KHRTSYTNKQLLELEKEFHFNKYLCSSRRSEIAKTLSLSERQVK IWFQNRRMKMKK HOXDa N. vectensis KHRTSYTNKQLLELEKEFHFNKYLCSSRRREISKALQLT ERQVKIWFQNRRMKWKK HOXDb N. vectensis KHRTSYTNKQLLELEKEFHFNKYLCSSRRREISKALQLT ERQVKIWFQNRRMKWKK HOXD HOXD Aiptasia KHRTSYTNKQLLELEKEFHFSRYLCTTRRREISKTLDLSERQVKIWFQNRRMKWKK HOXE Aiptasia HRRKAYNR EQLLELEKEFHFTRYLTKERRSELASMLKLSERQXK IWFQNRRMKWK K HOXE N. vectensis HKRMAYTRIQLLELEKEFHFTRYLTK ERRTEMARMLDLT ERQVK IWFQNRRMKWKK HOXE 624 HOXE A. digitifera HRRVAYTRNQ I LELEKEFHFTRYLTKERRSEMSTMLK LSERQ I KIWFQNRRMKWKK 41

625 Fig. S3. Genomic rearrangements and synteny. (A) Diversification of transposable

626 elements (TEs) in Aiptasia (1), A. digitifera (2), and N. vectensis (3). For each organism,

627 periods of TE diversification were identified by plotting the cumulative percentage of the

628 genome covered by each of the seven most abundant TE classes in that organism

629 (ordinate) against the Jukes-Cantor substitution distances [abscissa; calculated based

630 on the nucleotide differences between the individual genomic TEs and the consensus

631 sequence for the corresponding TE family (see SI Materials and Methods)]. (B) Plots as

632 in A for the two most abundant families in the BEL (top), CR1 (middle), and Gypsy

633 (bottom) TE classes in Aiptasia; the Gypsy and BEL families show predominantly an

634 ancient divergence, whereas the CR1 families appear to have undergone at least two

635 periods of divergence. (C) Alignment of the homeobox domains of HOX-like proteins of

636 Aiptasia (this study), N. vectensis (15), and A. digitifera (16) (see SI Materials and

637 Methods, Dataset S1.2).

638 42

Low High A B Larvae Apo Partial Sym Cluster of origin

1.0 O OO Symbiotic anemones O

Aposymbiotic larvae

0.5 O OO 2

O Partially infected anemones 0.0 Leading logFC dim

0.5 O O Symbiotic larvae O OO Aposymbiotic anemones O 1.0

1 0 1 2 3 4

Leading logFC dim 1

639

640 Fig. S4. Differential expression and lack of operon-like behavior of the apparently

641 taxonomically restricted genes (TRGs). (A) MDS plot (see SI Materials and Methods) of

642 expression levels of 1,946 putative TRGs among symbiotic and developmental states.

643 Replicates (same color) of each developmental and symbiotic state cluster together and

644 separate clearly along both the developmental (horizontal) and status-of-infection

645 (vertical) axes, indicating overall changes in gene expression between the different

646 states. (B) Expression levels (FPKM values) in different developmental and symbiotic

647 states of putative TRGs present in clusters with ≥4 genes per cluster (see SI Materials

648 and Methods). Each sample (see panel A) is represented by one column in the heat 43

649 map, and genes originating from the same cluster are indicated by the same color in the

650 key on the right.

651 44

A 2,154 bp 4,522 bp 1,491 bp 458 bp 7,124 bp 844 bp 5,002 bp 458 bp 530 bp

NPC2A NPC2F NPC2B NPC2D NPC2C NPC2E

B

Glycolysis

D-Glyceraldehyde 3-phospate

DXS Acetyl CoA Lanosterol Found in Aiptasia protein set ACAT Found in Aiptasia genome 1-Deoxy-D-xylulose 5-phospate CYP51 or transcriptome

DXR Acetoactyl-CoA Found in A. digitifera protein set 4,4,-Dimethyl-cholesta- HMGCS 8,14,24,-trienol Found in A. digitifera genome 2-C-Methyl-D-erythritol 4-phosphate or transcriptome ispD, ispDF 3-Hydroxy-3-methyl-glutaryl-CoA TM7SF2 Found in N. vectensis protein set pathway Found in Symbioidinium minutum 4-(Cytidine 5’-diphospho)-2-C-methyl-erythritol HMGCR 14-Demethyl-lanosterol genome ispE MEP/DOXP pathway MEP/DOXP

Mevalonate Mevalonate MSMO1

2-Phospho-4-(Cytidine 5’-diphospho)-2-C-methyl-erythritol MVK 4-Methylzymosterol-carboxylate ispF, ispDF Mevalonate-5P NSDHL 2-C-Methyl-D-erythritol 2,4-cyclodiphosphate PMVK 3-Keto-4-methyl-zymosterol 4-hydroxy-3-methylbut-2-en-1-yl Mevalonate-5PP diphosphate synthase HSD17B7 MVD 1-Hydroxy-2-methyl-2-butenyl 4-diphosphate 4-Methylzymosterol

ispH Isopentenyl-PP

FDPS Zymosterol DHCR24 Geranyl-PP EBP EBP FDPS DHCR24 Lathosterol (E,E)-Farnesyl-PP SC5DL FDFT 7-Dehydrodemosterol DHCR24 7-Dehydrocholesterol

Presqualene-PP SC5DL

FDFT Desmosterol DHCR24 Cholesterol

Squalene

SQLE

(S)-Squalene-2,3,-epoxide

LSS 652

653 Fig. S5. Sterol transport and biosynthesis in cnidarian-dinoflagellate symbiosis. (A) The

654 six Aiptasia genes encoding Npc2-family sterol-transport proteins are found at three

655 genomic loci that appear to have different source sequences based on their distinct

656 intron-exon structures. Arrows, gene orientations; blue boxes, exons; black carets, 45

657 introns; numbers, distances between start and stop codons; arrowhead, gene that is

658 upregulated in symbiotic relative to aposymbiotic anemones (19, 21) and whose product

659 appears to localize to the symbiosome (22). NPC2A-E were defined previously (19);

660 NPC2F was newly identified from the genome sequence (Dataset S1.5). (B) Absence in

661 cnidarians, but presence in Symbiodinium, of enzymes essential for the de novo

662 synthesis of sterols (KEGG Accession No. ko00100). Results from KEGG-assisted

663 searching (see SI Materials and Methods) of the predicted Aiptasia, N. vectensis, A.

664 digitifera, and S. minutum protein sets are shown. As expected, cnidarians, like other

665 animals, can apparently synthesize isoprenoids (which have a variety of biochemical

666 functions) by the mevalonate pathway (KEGG Accession No. ko00900). In contrast,

667 Symbiodinium can apparently use the MEP/DOXP pathway (known also from plants and

668 apicomplexan parasites) to synthesize isoprenoids (KEGG Accession No. ko00900).

669 PTHR23125 F−BOX/LEUCINE RICH REPEAT PROTEIN

PTHR24256 TRANSMEMBRANE PROTEASE, SERINE

PTHR10024 SYNAPTOTAGMIN

PTHR23282 APICAL ENDOSOMAL GLYCOPROTEIN PRECURSOR.

PTHR11360 MONOCARBOXYLATE TRANSPORTER

PTHR19277 PENTRAXIN

PTHR10157 DOPAMINE BETA HYDROXYLASE RELATED

PTHR10869 PUTATIVE HIF−PROLYL HYDROXYLASE PH−4

PTHR14905 NG37

PTHR14241 INTERFERON−INDUCED PROTEIN 44−RELATED

PTHR12548 TRANSCRIPTION FACTOR DP

PTHR23158 MELANOMA INHIBITORY ACTIVITY−RELATED

PTHR18895 METHYLTRANSFERASE

PTHR11086 DEOXYCYTIDYLATE DEAMINASE

PTHR22826 SERINE/THREONINE−PROTEIN KINASE DUET

PTHR23274 DNA HELICASE−RELATED

PTHR11003 POTASSIUM CHANNEL, SUBFAMILY K

PTHR10981 BATTENIN

PTHR16146 INTELECTIN

PTHR12369 CHONDROITIN SYNTHASE

PTHR12011 G−PROTEIN COUPLED RECEPTOR−RELATED

PTHR12011 G−PROTEIN−COUPLED RECEPTOR (GPCR) FAMILY PROTEIN

PTHR11537 VOLTAGE−GATED POTASSIUM CHANNEL

PTHR10725 THAP DOMAIN−CONTAINING PROTEIN 9

PTHR10333 INHIBITOR OF GROWTH PROTEIN

PTHR19331 LYSYL OXIDASE−RELATED PTHR10098 TETRATRICOPEPTIDE REPEAT PROTEIN 28 46 PTHR10098 RAPSYN−RELATED

PTHR10127 DISCOIDIN, CUB, EGF, LAMININ , AND ZINC METALLOPROTEASE DOMAIN CONTAINING

PTHR24249 HISTAMINE RECEPTOR−RELATED G−PROTEIN COUPLED RECEPTOR

PTHR24416 TYROSINE−PROTEIN KINASE RECEPTOR

PTHR10178 OS01G0751800 PROTEIN

PTHR19143 FIBRINOGEN/TENASCIN/ANGIOPOEITIN

PTHR10489 CELL ADHESION MOLECULE Bfl Lgi Aip Adi Cel Tad Hro Dre Nve Cca Hsa Aqu A Hma Dme Z-scoreColor Key

-3−3 −2 −1 0 1 2 33 Row Z−Score PTHR24256 TRANSMEMBRANE PROTEASE, SERINE PTHR10024 SYNAPTOTAGMIN PTHR23282 APICAL ENDOSOMAL GLYCOPROTEIN PRECURSOR. PTHR11360 MONOCARBOXYLATE TRANSPORTER PTHR19277 PENTRAXIN PTHR10157 DOPAMINE BETA HYDROXYLASE RELATED PTHR14905 NG37 PTHR12548 TRANSCRIPTION FACTOR DP PTHR18895 METHYLTRANSFERASE PTHR11086 DEOXYCYTIDYLATE DEAMINASE PTHR10981 BATTENIN PTHR16146 INTELECTIN PTHR12369 CHONDROITIN SYNTHASE PTHR10333 INHIBITOR OF GROWTH PROTEIN PTHR10098 TETRATRICOPEPTIDE REPEAT PROTEIN 28 PTHR10127 DISCOIDIN, CUB, EGF, LAMININ , AND ZINC METALLOPROTEASE DOMAIN PTHR10178 OS01G0751800 PROTEIN PTHR19143 FIBRINOGEN/TENASCIN/ANGIOPOEITIN PTHR10489 CELL ADHESION MOLECULE Lgi Aip Adi Cel Dre Hro Cte Tad Hsa Nve Aqu Hma Dme

B C Aip_ILR3 (AIPGENE25109) 1 Rac1 PI3K AKT Nve_ILR4 (gi|156390934) IRF7 DNA Adi_ILR5 (adi_v1.08563) 0.86 TRAM TRIF TRAF3 TBK1 IRF3 DNA Adi_ILR6 (adi_v1.14217) 1 TLR TRAF6 DNA 0.42 Adi_ILR4 (adi_v1.16874) RIP1 ILR TAB1 Adi_ILR3 (adi_v1.23366) TOLLIP TAB2 IRAK1 +p +p MyD88 TRAF6 MAP3K7 IRAK4 0.88 Adi_ILR1 (adi_v1.20402) LRR 1 0.59 ? TIRAP DNA TIR Aip_ILR1 (AIPGENE22027) FADD CASP8 p105 ERK +p 0.79 Aip_ILR2 (AIPGENE28155) +p Tpl2 MEK1/2 AP-1 0.3 Nve_ILR1 (gi|156407428) Found in Aiptasia protein set 0.22 0.36 MKK3/6 p38 0.83 Adi_ILR2 (adi_v1.11844) Interleukin-like receptors Found in A.digitifera protein set MKK4/7 Nve_ILR2 (gi|156373783) Found in N. vectensis protein set JNK Aip_ILR4 (AIPGENE23125) D 0.55 1 Adi_ILR7 (adi_v1.19280) 0.86 SUGT1 CARD6 A20 Found in Aiptasia gene-model set Nve_ILR3 (gi|156373773) Found in Aiptasia genome cIAP1/2 CASP8 or transcriptome Hsa_IL1R1 (sp|P14778) +u 1 Found in Acropora gene-model set iE-DAP NLR TRIP6 DNA Mmu_IL1R1 (sp|P13504) Found in Acropora genome Bacteria +u or transcriptome RIPK1/2 TRAF6 Adi_TLR1 (adi_v1.20813) Found in Nematostella gene-model set Peptidoglycan 0.37 +p Adi_TLR4 (adi_v1.14728) MDP NLR 0.27 CARD9 MAP3K7 Adi_TLR2 (adi_v1.24300) 0.99 TAB1 TAB2/3 1 Adi_TLR3 (adi_v1.14729) Erbin ERK

Nve_TLR (gi|156397993) Toll-like receptors JNK DNA 1 p38 Hsa_TLR1 (sp|Q15399) 1 Mmu_TLR1 (sp|Q9EPQ1)

670 Found in Aiptasia gene-model set Found in Aiptasia genome or transcriptome Found in Acropora gene-model set 671 Fig. S6. Mechanisms potentially involved in the interaction of Aiptasia with Found in Acropora genome or transcriptome Found in Nematostella gene-model set 672 Symbiodinium and other microbes. (A) Functional-domain enrichment in the Aiptasia

673 genome relative to 13 other metazoan genomes (Table S12) based on PANTHER-

674 domain annotation (see SI Materials and Methods). The fibrinogen domain is 47

675 highlighted. (B) Maximum-likelihood phylogenetic tree (see SI Materials and Methods)

676 of the Toll-like-receptor (TLR) and Interleukin-like-receptor (ILR) proteins of three

677 anthozoans (see also (39)). Species abbreviations as in panel A; gene identifiers in

678 parentheses; numbers on nodes, bootstrap values based on 100 iterations; green and

679 blue highlights, Aiptasia and A. digitifera, respectively. (C,D) Components of putative

680 TLR and ILR (C) and "nucleotide-binding and oligomerization domain" (NOD)-like (D)

681 pattern-recognition and signaling pathways identified in the Aiptasia and other

682 anthozoan predicted gene sets by a combination of Pfam annotation and KEGG-

683 pathway analysis (see SI Materials and Methods). Color code for organisms applies to

684 both panels.

685 48

tr|G0S5G1|G0S5G1_CHATD

tr|Q2H0D1|Q2H0D1_CHAGB

tr|G2Q4S1|G2Q4S1_THIHA

tr|B2AZL1|B2AZL1_PODAN

tr|U7PWU2|U7PWU2_SPOS1

tr|N1JQG9|N1JQG9_BLUG1

tr|S3C8V7|S3C8V7_OPHP1

tr|M4FLS7|M4FLS7_MAGP6

tr|F0XFJ8|F0XFJ8_GROCL

tr|F7VNJ0|F7VNJ0_SORMK

tr|G4UJ86|G4UJ86_NEUT9

tr|F8MJE9|F8MJE9_NEUT8 tr|Q7S8J4|Q7S8J4_NEUCR

tr|G9P121|G9P121_HYPAI tr|T5AI54|T5AI54_OPHSC

tr|G3J8H0|G3J8H0_CORMM

tr|E9DUS6|E9DUS6_METAQ tr|E9F2K3|E9F2K3_METAR 100 tr|S0DNC1|S0DNC1_GIBF5 tr|J5JMA0|J5JMA0_BEAB2

tr|G9MSR8|G9MSR8_HYPVG tr|N1RE22|N1RE22_FUSC4 tr|G0RB45|G0RB45_HYPJQ tr|N4U126|N4U126_FUSC1 100 50 tr|C7Z7K1|C7Z7K1_NECH7 19 tr|J9MZF9|J9MZF9_FUSO4tr|F9FE40|F9FE40_FUSOFtr|A0A016PW56|A0A016PW56_GIBZA 100 85 100 tr|I1RZC0|I1RZC0_GIBZE100 tr|K3UXR5|K3UXR5_FUSPC tr|G1X312|G1X312_ARTOA tr|U4L0L2|U4L0L2_PYROM tr|K1WRL2|K1WRL2_MARBU 93 tr|A7F670|A7F670_SCLS1

tr|D5GPX0|D5GPX0_TUBMM 95 tr|R8BE58|R8BE58_TOGMI tr|S8A617|S8A617_DACHA tr|G2XFY4|G2XFY4_VERDV 100

100

tr|L8GCD7|L8GCD7_PSED2 100

tr|J3NNQ9|J3NNQ9_GAGT3

tr|L7J6T6|L7J6T6_MAGOP B tr|S3DBZ6|S3DBZ6_GLAL2

tr|M7U8L5|M7U8L5_BOTF1

100 tr|G4MVP8|G4MVP8_MAGO7 94

tr|G2YVH1|G2YVH1_BOTF4 tr|G4MVP7|G4MVP7_MAGO7 70 73 67

100 90 98 tr|L7I385|L7I385_MAGOY

83

100

100

A 65 45 tr|H1VSF5|H1VSF5_COLHItr|E3Q9X0|E3Q9X0_COLGM 100 96 tr|N4UZ70|N4UZ70_COLOR

98 100 tr|L2FU11|L2FU11_COLGN

100 100

100

100 93

51

96 tr|M7SC11|M7SC11_EUTLA tr|F2TND0|F2TND0_AJEDA tr|C5GYI3|C5GYI3_AJEDR 56 tr|A6RD03|A6RD03_AJECNtr|C5K1P3|C5K1P3_AJEDS tr|C6H1D8|C6H1D8_AJECHtr|F0ULA5|F0ULA5_AJEC8 93 99 tr|E5A810|E5A810_LEPMJ

tr|F2SSM7|F2SSM7_TRIRC tr|C1FZY5|C1FZY5_PARBDtr|C0NZ68|C0NZ68_AJECG tr|D4ANY7|D4ANY7_ARTBC 100 tr|D4DJT8|D4DJT8_TRIVH tr|C0S8Q3|C0S8Q3_PARBP

94 tr|C5P4B2|C5P4B2_COCP7tr|J3K1T3|J3K1T3_COCIMtr|C1GSV4|C1GSV4_PARBA tr|E9D696|E9D696_COCPS 67 tr|F2PT29|F2PT29_TRIEC tr|F2RSA9|F2RSA9_TRIT1 61 100 77 tr|E4V5N7|E4V5N7_ARTGP tr|C4JI53|C4JI53_UNCRE tr|Q0V1N7|Q0V1N7_PHANO 96 tr|R0JZV1|R0JZV1_SETT2 tr|M2S7E3|M2S7E3_COCSN tr|M3D0Q6|M3D0Q6_SPHMS tr|S9R1L3|S9R1L3_SCHOY tr|C5FGJ0|C5FGJ0_ARTOC 93 tr|S9X622|S9X622_SCHCR 57 tr|R7YTT1|R7YTT1_CONA1 99tr|M2UC13|M2UC13_COCH5 99 100tr|N4X2P4|N4X2P4_COCH4 tr|K2S595|K2S595_MACPH tr|F9X3A4|F9X3A4_MYCGM tr|Q2UDH7|Q2UDH7_ASPOR tr|N1Q789|N1Q789_MYCFI sp|O13621|YNG5_SCHPO tr|R1GNL7|R1GNL7_BOTPV tr|E3RJ63|E3RJ63_PYRTT 100 tr|B2WF72|B2WF72_PYRTR tr|N1PGA6|N1PGA6_MYCP1 n = 15 76 100 tr|T5AA80|T5AA80_OPHSCtr|M2SJP6|M2SJP6_COCSN tr|E9DZ80|E9DZ80_METAQ n = 17 86 100 99 99 tr|N4XPD7|N4XPD7_COCH4 n = 29 tr|M2M9G2|M2M9G2_BAUCO 100 100 tr|E9F610|E9F610_METAR tr|Q0CNB9|Q0CNB9_ASPTNtr|I7ZVE9|I7ZVE9_ASPO3 tr|M2UXH5|M2UXH5_COCH5 100 100 81 60 100 87 tr|M1W3V4|M1W3V4_CLAP2 100 tr|B8N5V7|B8N5V7_ASPFN 100 32 tr|J4UKK2|J4UKK2_BEAB2

34 tr|L7IYH9|L7IYH9_MAGOP 100 100 54 tr|J3NH49|J3NH49_GAGT3100 78 98 99 91 tr|G3JGX9|G3JGX9_CORMMtr|G9N107|G9N107_HYPVG 68 67 tr|L7HYH6|L7HYH6_MAGOY 91 tr|G3XPG7|G3XPG7_ASPNA 81 tr|G2X1H6|G2X1H6_VERDVtr|C9SEG0|C9SEG0_VERA1 100 tr|G0RVP7|G0RVP7_HYPJQ tr|A2R347|A2R347_ASPNC 79 100 tr|V2Y1P4|V2Y1P4_MONRO 100 99 99 tr|G9NI39|G9NI39_HYPAI tr|A1D9Y5|A1D9Y5_NEOFItr|G7XJS2|G7XJS2_ASPKW 99 n = 15 100 tr|H6CA22|H6CA22_EXODN 100 n = 17 tr|Q4WA18|Q4WA18_ASPFU 91 n = 142 tr|V5GU17|V5GU17_PSEBG100 tr|A1C9M0|A1C9M0_ASPCLtr|B0YES0|B0YES0_ASPFC 100 91 tr|B6JZY0|B6JZY0_SCHJY tr|Q5BAF3|Q5BAF3_EMENI 100 15 A tr|I2FVU3|I2FVU3_USTH4 31 100 100 tr|F9G946|F9G946_FUSOF 54 tr|S0EIV0|S0EIV0_GIBF5 5% tr|U1GC28|U1GC28_ENDPU 6% 4% tr|C7ZGM8|C7ZGM8_NECH7100 tr|B6QU56|B6QU56_PENMQtr|B6H0A9|B6H0A9_PENCW tr|S8AWW0|S8AWW0_PENO1tr|K9FJM7|K9FJM7_PEND2 59tr|J9N935|J9N935_FUSO4 100 88 96 92 tr|K9FBD4|K9FBD4_PEND1 83 tr|N1RJ58|N1RJ58_FUSC4 tr|B8MP59|B8MP59_TALSN 100 98 97 tr|K3VJE0|K3VJE0_FUSPC tr|N4TSK6|N4TSK6_FUSC1 49 tr|G7E0N0|G7E0N0_MIXOS 100 84 tr|V5GDW3|V5GDW3_BYSSN 100 100 tr|A0A016Q227|A0A016Q227_GIBZA tr|E3KMK4|E3KMK4_PUCGT 86 50 100 99 100 tr|J3Q2J7|J3Q2J7_PUCT1 30 tr|I1RJF0|I1RJF0_GIBZE tr|M7WUB7|M7WUB7_RHOT1 100 45 tr|G0SY65|G0SY65_RHOG2 100 100 74 5% 2% 17% 71 n = 7 100 tr|B7G9W7|B7G9W7_PHATC 98 99 100 93 tr|E3Q7R2|E3Q7R2_COLGM 100 tr|G2QG98|G2QG98_THIHA tr|U5HDH9|U5HDH9_USTV1 tr|N4VXK9|N4VXK9_COLOR tr|K0TEN3|K0TEN3_THAOC 93 tr|T0KFY1|T0KFY1_COLGC 100 99 symbA_5351 50 100 100 10063 tr|L2FEW0|L2FEW0_COLGN

66 tr|H3GJM8|H3GJM8_PHYRM 100 90 90 69 tr|V2YVF7|V2YVF7_MONRO 3% 81 34 tr|K2SAS2|K2SAS2_MACPH tr|K9FKY7|K9FKY7_PEND2 tr|S8AWY1|S8AWY1_PENO1100 tr|R1GJL3|R1GJL3_BOTPV tr|K9G0D2|K9G0D2_PEND1 tr|A9RCS9|A9RCS9_PHYPAtr|D0MV21|D0MV21_PHYIT 100 tr|D8SXD0|D8SXD0_SELML tr|B6H116|B6H116_PENCW tr|K4BMT4|K4BMT4_SOLLC tr|V5FGZ1|V5FGZ1_BYSSN tr|Q0CVZ6|Q0CVZ6_ASPTN tr|K7L7P8|K7L7P8_SOYBN 100 100 n = 4 tr|D8RNX6|D8RNX6_SELML 52 100 tr|C8VD59|C8VD59_EMENI 99 36 tr|I8TK25|I8TK25_ASPO3 tr|W5D876|W5D876_WHEAT tr|I1KUY0|I1KUY0_SOYBN sp|A0MFS9|CACLC_ARATHtr|M4DHP6|M4DHP6_BRARP 100tr|B8NAD4|B8NAD4_ASPFN 100 tr|Q2U154|Q2U154_ASPORtr|G7XJ67|G7XJ67_ASPKW 100 99 63 99 82 tr|A2R2U2|A2R2U2_ASPNC 100 34 tr|U5CZ06|U5CZ06_AMBTC 100 100 100 100 tr|G3XNX7|G3XNX7_ASPNA 99 27

tr|I1HQT5|I1HQT5_BRADItr|K3XFC7|K3XFC7_SETIT100 32 98 52 tr|A1C9E6|A1C9E6_ASPCL tr|J3L3B9|J3L3B9_ORYBR tr|A1D9V1|A1D9V1_NEOFI 2% adi_v1.11242 tr|I1NR40|I1NR40_ORYGL 100 tr|B0YEM7|B0YEM7_ASPFC 99 100 100 symbA_3735896 tr|E3N3M5|E3N3M5_CAERE sp|Q0JJZ6|CACLC_ORYSJ tr|Q4W9Y1|Q4W9Y1_ASPFU 55 tr|G4YHA3|G4YHA3_PHYSP tr|K0KUD5|K0KUD5_WICCFspi_6769288 tr|D0MXY8|D0MXY8_PHYIT100 52 99 52 97 100 67 symbB.v1.2.021490.t1 n = 96 tr|A8Y0C8|A8Y0C8_CAEBR100 41 tr|G0NNL3|G0NNL3_CAEBE 99 AIPGENE23064 symbA_24287 symbB.v1.2.021492.t1 tr|A8Y0C9|A8Y0C9_CAEBRtr|E3N3M6|E3N3M6_CAERE 99 27 56 gi|449675109|ref|XP_002170516.2|64 tr|U1M9P6|U1M9P6_ASCSU 99 tr|W4XKJ7|W4XKJ7_STRPU 32 14 78 49 tr|W4XKZ2|W4XKZ2_STRPU tr|D7FW97|D7FW97_ECTSItr|B8C6T3|B8C6T3_THAPS 1 tr|F6PWV9|F6PWV9_XENTR 56 tr|K0T0V0|K0T0V0_THAOC tr|C5KR87|C5KR87_PERM5 tr|I3M7L9|I3M7L9_SPETR 100 100 94 tr|G5B7V6|G5B7V6_HETGA tr|E2R968|E2R968_CANFA 12% tr|L5LJX9|L5LJX9_MYODS 100 tr|C5K971|C5K971_PERM5 100 74 tr|U3I447|U3I447_ANAPL tr|B7G411|B7G411_PHATC tr|A4H763|A4H763_LEIBR tr|M3VVU2|M3VVU2_FELCA84 44 100 35 78 3 77 tr|X6N8E4|X6N8E4_RETFI tr|E9AP98|E9AP98_LEIMU 98 100 98 93 tr|Q4QG80|Q4QG80_LEIMA tr|B0UXS9|B0UXS9_DANRE 100 tr|T1KCC0|T1KCC0_TETUR 23 57 81 tr|E9BBD7|E9BBD7_LEIDB 95 22 100 tr|B3RR64|B3RR64_TRIADspi_6786501 tr|A4HVK2|A4HVK2_LEIIN tr|U6I1Z8|U6I1Z8_ECHMU 13 33 tr|U6IF40|U6IF40_HYMMI 30 9 47 20 tr|G4VH10|G4VH10_SCHMA tr|G7Y964|G7Y964_CLOSI tr|B7Q0L3|B7Q0L3_IXOSC 47 tr|T1IMB0|T1IMB0_STRMM 8 41 33 100 80 tr|J9KB30|J9KB30_ACYPI 35 49 tr|Q24E70|Q24E70_TETTS n = 217 tr|G6D6G8|G6D6G8_DANPL 36 tr|Q233Y7|Q233Y7_TETTS tr|B4Q5Z9|B4Q5Z9_DROSI 1 98 67 tr|Q55GA1|Q55GA1_DICDI tr|M9PCX9|M9PCX9_DROME 22 23 13 symbB.v1.2.007538.t1 tr|B4HXM4|B4HXM4_DROSE 100 65 31 98 78 10 62 100 tr|B3N5P1|B3N5P1_DROER55 78 symbB.v1.2.007538.t2 n = 522 31 symbA_3091425 tr|B4NYE1|B4NYE1_DROYA 98 23 9 100 tr|B4GK85|B4GK85_DROPE 31 symbB.v1.2.004469.t1 symbB.v1.2.040834.t1 28 100 100 74 92 28 tr|B4JAX6|B4JAX6_DROGR 41 29 12 51 100 100 symbA_3815062 symbA_2330 symbB.v1.2.004469.t3 tr|B4LRX5|B4LRX5_DROVI 31 100 80% 91 tr|M4B5K8|M4B5K8_HYAAE tr|B5DVC3|B5DVC3_DROPS 70 tr|G4YV65|G4YV65_PHYSP symbB.v1.2.026141.t1 tr|B4KLB3|B4KLB3_DROMO 59 tr|D7FQE1|D7FQE1_ECTSI tr|Q7PMB1|Q7PMB1_ANOGA tr|F4WQE1|F4WQE1_ACREC 41 tr|W5J5Y8|W5J5Y8_ANODA 100 81 tr|W4W1H2|W4W1H2_ATTCE 59 9 tr|K3W8D0|K3W8D0_PYTULtr|F0XZJ9|F0XZJ9_AURAN tr|K7IS28|K7IS28_NASVI tr|H3GG39|H3GG39_PHYRM tr|E2AYX8|E2AYX8_CAMFOspi_6777360 100 63% tr|E9IAS4|E9IAS4_SOLIN 91 tr|F2U1Q0|F2U1Q0_SALR5 n = 17 49 tr|M4BN76|M4BN76_HYAAE tr|F0YCL0|F0YCL0_AURAN 15 tr|C3XUD9|C3XUD9_BRAFL 64 67 49 100 gi|449669611|ref|XP_002155765.2|100 40 62 100 44 tr|I1GDN9|I1GDN9_AMPQE 3553 tr|I1GDN8|I1GDN8_AMPQE 12 36 tr|A0BSX3|A0BSX3_PARTE tr|H2YH01|H2YH01_CIOSA 100 tr|T1HAX3|T1HAX3_RHOPRtr|C3XPX2|C3XPX2_BRAFL 66 tr|T1ISC6|T1ISC6_STRMM 46 35 94 tr|G4V6U1|G4V6U1_SCHMA tr|H2YH02|H2YH02_CIOSAtr|F6XJF3|F6XJF3_CIOIN 57 tr|W5LYV4|W5LYV4_LEPOC 46 tr|W5LYT5|W5LYT5_LEPOC60 28 tr|T1JS39|T1JS39_TETUR 38 99 2% 100 tr|F6YFE4|F6YFE4_CIOIN tr|E9FRS9|E9FRS9_DAPPU 94 20 100 88 93 98 tr|H2ZSH6|H2ZSH6_LATCH 96 tr|F6PI05|F6PI05_XENTR tr|G1K9Y5|G1K9Y5_ANOCA 7 81 tr|I3IXI4|I3IXI4_ORENI 36 37 spi_6644113 20 91 100 100 sp|Q6IFT6|ANO7_RAT 58 66 spi_6382013100 tr|E4X5E5|E4X5E5_OIKDI tr|H2ZTP8|H2ZTP8_LATCH 8 98 99 100 44 tr|V8NTI3|V8NTI3_OPHHAtr|G3H973|G3H973_CRIGR spi_6644114 98 33 tr|F6SJY7|F6SJY7_MONDOtr|M3XM57|M3XM57_MUSPF spi_6785579 99 35 69 tr|L9L0W1|L9L0W1_TUPCH99 tr|D3ZF54|D3ZF54_RAT 54 47 87 95 tr|G3WB35|G3WB35_SARHA 100 8578 100 29 57 tr|F6SK27|F6SK27_CALJA spi_6783257 74 34 tr|G7NY57|G7NY57_MACFA 98 tr|G3QP18|G3QP18_GORGO 12 spi_6663810 54 100 89 84 tr|F6PR11|F6PR11_CALJAtr|G1R1T1|G1R1T1_NOMLE 100 98 53 gi|221123013|ref|XP_002167773.1| tr|M3VZZ8|M3VZZ8_FELCA 90 spi_626801497 14

tr|E2RCN8|E2RCN8_CANFA 23 tr|A7REU8|A7REU8_NEMVE tr|M3ZHH9|M3ZHH9_XIPMA 17 98 tr|A7REV0|A7REV0_NEMVE 61 gi|449662222|ref|XP_002167714.2| tr|W5P1T1|W5P1T1_SHEEPtr|I3LZQ7|I3LZQ7_SPETR 97 tr|L5KHY5|L5KHY5_PTEAL tr|H2LF03|H2LF03_ORYLA 89 adi_v1.15809 tr|F6UJG9|F6UJG9_MONDO gi|449662224|ref|XP_002163387.2| 84 99spi_6658751 tr|H3BXL9|H3BXL9_TETNG 10 99 spi_6241245 tr|W5JXV1|W5JXV1_ASTMX100 spi_6436177 100 adi_v1.18389 sp|Q4V8U5|ANO10_DANRE 63

45 52 tr|F6QGK5|F6QGK5_HORSE 100 53tr|F6QAR9|F6QAR9_HORSE 100 100 76tr|G3GUL3|G3GUL3_CRIGRtr|G1KIN6|G1KIN6_ANOCAadi_v1.18388 100 tr|T1HEM8|T1HEM8_RHOPR 98 99 100 tr|H0UV06|H0UV06_CAVPO 100

tr|B0WJ80|B0WJ80_CULQU 99 83 100 tr|G3NG32|G3NG32_GASAC tr|H2NCJ5|H2NCJ5_PONAB 100 95 adi_v1.01350 tr|K1PX34|K1PX34_CRAGI 94 tr|F2TWL5|F2TWL5_SALR5 100 tr|G1QSW4|G1QSW4_NOMLE 100 6278 tr|H0WT64|H0WT64_OTOGA 100 36 tr|L8XZM0|L8XZM0_TUPCHtr|D2HIX5|D2HIX5_AILME tr|H2RRA2|H2RRA2_TAKRU 65 100 94 35 tr|E1BGK6|E1BGK6_BOVINtr|G3NCR8|G3NCR8_GASAC 100 100 100tr|E1C803|E1C803_CHICK

100 tr|U3JF86|U3JF86_FICAL tr|I3J8B8|I3J8B8_ORENI

tr|H2RRA3|H2RRA3_TAKRUtr|E0VEM7|E0VEM7_PEDHC 81 96 tr|E9CBW7|E9CBW7_CAPO3 97 tr|T1FTD8|T1FTD8_HELRO spi_6582788 85 100 tr|B3RRJ5|B3RRJ5_TRIAD 100 spi_6665245_2 100tr|B3M242|B3M242_DROAN spi_6625344 tr|I5APX8|I5APX8_DROPS 83 100 33 Bacteria 98 100 98 72 100 100 47 Symbiodinium 69 39 tr|G5EBW3|G5EBW3_CAEEL adi_v1.02627 67 tr|K3W6E6|K3W6E6_PYTUL spi_6787853_4 tr|K1PYR2|K1PYR2_CRAGI 98 tr|C0P286|C0P286_CAEEL 87 100 tr|W4WJV1|W4WJV1_ATTCE100 41 42 56 35 tr|H2LF01|H2LF01_ORYLA 39 100 tr|K7IMJ8|K7IMJ8_NASVI tr|T1FWX8|T1FWX8_HELRO tr|G0MC31|G0MC31_CAEBE 100 97 tr|E0VCR5|E0VCR5_PEDHC 100 99 tr|E9FW08|E9FW08_DAPPU tr|A0BM99|A0BM99_PARTE 99 99 spi_6665245_1 99 tr|H9J1C7|H9J1C7_BOMMO 99 100 35 69 100 99 tr|E9IRX4|E9IRX4_SOLIN tr|B4L1V1|B4L1V1_DROMO 100 67 100

98

adi_v1.02628 tr|B4IER4|B4IER4_DROSE tr|D2A4M0|D2A4M0_TRICA spi_6499819 100 99 tr|J9K6P9|J9K6P9_ACYPI 89

tr|B4MSC6|B4MSC6_DROWI 97 spi_6573089 100 tr|B7PF91|B7PF91_IXOSC 100 spi_6588796 spi_6787853_1 87 tr|E2BRL6|E2BRL6_HARSA tr|F6T3H9|F6T3H9_ORNAN 100 tr|U3JK59|U3JK59_FICAL 100 100 tr|A8PDZ9|A8PDZ9_BRUMA 56 100 tr|U9TFT2|U9TFT2_RHIID tr|E2BCH4|E2BCH4_HARSA tr|E1BVP3|E1BVP3_CHICK 39 adi_v1.21048 78 tr|H0Z2N3|H0Z2N3_TAEGU tr|H9KCQ0|H9KCQ0_APIME tr|E1FRS2|E1FRS2_LOALO tr|V8NQM5|V8NQM5_OPHHA tr|G7PHN1|G7PHN1_MACFA 93 tr|M3ZGE1|M3ZGE1_XIPMA tr|D2A0B2|D2A0B2_TRICA tr|H2Q5R7|H2Q5R7_PANTR tr|Q0IEX5|Q0IEX5_AEDAE 87 tr|G1MT02|G1MT02_MELGA sp|Q4KMQ2|ANO6_HUMAN tr|G6CWE3|G6CWE3_DANPL tr|L5KCF0|L5KCF0_PTEAL tr|B0WLG6|B0WLG6_CULQU tr|G1P8H6|G1P8H6_MYOLU tr|H9JNR6|H9JNR6_BOMMO tr|M3XMS2|M3XMS2_MUSPF tr|F4WM46|F4WM46_ACREC tr|H9K086|H9K086_APIME tr|L5LJE0|L5LJE0_MYODS tr|G1SCS8|G1SCS8_RABIT

tr|G3SPY5|G3SPY5_LOXAF tr|H3BXS6|H3BXS6_TETNG sp|A6QLE6|ANO4_BOVIN 67 100 tr|B4LHU1|B4LHU1_DROVI 100 tr|R0LBR7|R0LBR7_ANAPL tr|B4MKX3|B4MKX3_DROWItr|B4J0P8|B4J0P8_DROGR tr|H3FJB5|H3FJB5_PRIPA adi_v1.02629 tr|F1SFZ6|F1SFZ6_PIG 100 tr|F6PYD8|F6PYD8_ORNAN 90 tr|G1L1D2|G1L1D2_AILME tr|G3QD80|G3QD80_GORGO 100

74 67 53 tr|B4GQY0|B4GQY0_DROPE tr|Q16L02|Q16L02_AEDAE Fungi 24 36 98 tr|B3NH46|B3NH46_DROER spi_6787853_3 tr|B3M4N6|B3M4N6_DROAN 68 tr|B4QQR6|B4QQR6_DROSI tr|J9AWB5|J9AWB5_WUCBA tr|W5JDY9|W5JDY9_ANODA Plants tr|B4PG75|B4PG75_DROYA tr|G3W5C4|G3W5C4_SARHA tr|H2NDV4|H2NDV4_PONABtr|H2Q3B5|H2Q3B5_PANTR sp|A2AHL1|ANO3_MOUSE tr|G3SQ42|G3SQ42_LOXAF gi|449670832|ref|XP_002168766.2| tr|H0V5S8|H0V5S8_CAVPO tr|W5KEV4|W5KEV4_ASTMX tr|G1P7R3|G1P7R3_MYOLU tr|F5HJR9|F5HJR9_ANOGA tr|U9T6G0|U9T6G0_RHIID tr|E2A0E0|E2A0E0_CAMFO 100 100 tr|G7Y2S8|G7Y2S8_CLOSI

gi|449670834|ref|XP_002168523.2| spi_6787853_2 100

tr|H0Z2A5|H0Z2A5_TAEGU gi|449662821|ref|XP_002155931.2| tr|G1MT92|G1MT92_MELGA

adi_v1.02631spi_6573089- tr|U6I3B9|U6I3B9_HYMMI tr|U6HJD6|U6HJD6_ECHMU tr|M9PGX9|M9PGX9_DROME

adi_v1.10194

tr|F4P699|F4P699_BATDJ

Archaea tr|G5AR27|G5AR27_HETGA tr|F1SFZ1|F1SFZ1_PIG tr|F6QKU4|F6QKU4_MACMU Viruses tr|H0WNM0|H0WNM0_OTOGA

tr|G1SCV2|G1SCV2_RABIT

tr|W5PD36|W5PD36_SHEEP

tr|F6QLJ6|F6QLJ6_MACMU

gi|449679784|ref|XP_002160479.2|

gi|449667375|ref|XP_002163550.2|

100

sp|P86044|ANO9_MOUSE

sp|A1A5B4|ANO9_HUMAN

Bacillus subtilis C Tetrahymena thermophila D 0.5 Arabidopsis thaliana Chlamydomonas reinhardtii 0.99 Chlamydomonas reinhardtii Pseudomonas aeruginosa 0.97 0.99 Escherichia coli 1 0.68 0.98 Neisseria elongata Caulobacter crescentus 0.59 0.83 Synechococcus elongatus Prochlorococcus marinus 0.81 0.77 Anabaena variabilis Anabaena variabilis 0.65 1 0.93 Trichodesmium erythraeum Symbiodinium microadriaticum 1 Ustilago maydis Heterocapsa triquetra 0.99 Saccharomyces cerevisiae Anabaena variabilis Anabaena variabilis 0.99 0.61 Trichodesmium erythraeum Trichodesmium erythraeum 0.86 1 Oxyrrhis marina Karlodinium veneficum 0.99 Karlodinium veneficum Oxyrrhis marina 1 Symbiodinium microadriaticum 1 1 Aiptasia 0.88 Heterocapsa triquetra 1 Cnidarians N. vectensis A. digitifera 0.85 Neurospora crassa Aiptasia Cnidarians 0.68 0.99 N. vectensis Aspergillus fumingatus

10 20 30 40 50 E Aip-Tox1 Aiptasia LYHYTD EEGA EG IQN SGV IK ESRVY LND LPV SHVKM SQ SSSVVYKHPGD LH LDYD Aip-Tox2 Aiptasia LYHYTNKAGADG IDK SKV IKG STVY FT S IA IEW VDVTKQ DND IYXYXGDVN LN LQ Aip-Tox3 Aiptasia LYHYTT EEGARG IKRDM KVKQ NHVY FTK LD ISHV-- INDG SD IWVHQ GD ID LDY I Aip-Tox4 Aiptasia FYHYTD ERGAYG ILSSG IIK ESTVYVTD LPV SYCQ ID LLTGQ VYKH EGD IM LEYN Aip-Tox5 Aiptasia FYHYTD EKGAQ G ISESGK IK ESTVY FTD LPVAYYQ IDKTM GQ VYRHDG EVD LEYN Aip-Tox6 Aiptasia FYHYTDKQ GA EG IRK SGK IT ESSVYVTT LTV SYYXTNK SLRDVYVHQ GD IV LEYN Smic-Tox1 Symbiodinium microadriaticum FYHYTNGPG LEG ILK EGR LAV SDTYGTN LVM DRY EL--RKDQ IFVHP EG ID LKYP Smin-Tox1 Symbiodinium minutum FENHKTREGLEG ILREGK IEESLTYGTNLQMDHYHL--RRDQ ILVHPGHVDLKYP gi|183596936 Providencia stuartii --HYTSSDGYKG IMGTGSINM SDVYVTTM SSTHYK IDRQDGKRLFIQEN INLDPN gi|183597403 Providencia stuartii ------M SDVYVTTM SSTHYK IDRQDGKRLFIQEN INLDPN gi|183597449 Providencia stuartii VYHYTSSDGYKG IMGTGSINM SDVYVTTM SSTHYK IDRQDGKRLFIQEN INLDPN gi|188026570 Providencia stuartii VYHYTSSDGYKG IMGTGSINM SDVYVTTM SSTHYK IDRQDGKRLFIQEN INLDPN gi|254387031 Streptomyces sp. LYHYTN EAGYDG IIESEELRA S IQY LSD IVKRPGQ LVPW GGAR FTHY IEVDVD L I gi|270292131 Streptococcus sp. MYHYTSEKGLAGILDTGTLNPSLQYFSDIARSNASLNPWQGSKYSNYIGVDTNLT gi|298371985 Bacteroidetes oral LRHYT SNKG LQ G IEESM T IKAGDKKYK ISTTRNYR IND L-G EEYK IKRDV ELDHK gi|301308498 Bacteroides sp. VYHYTSKKSYNK ISSQSPYVFKAVYVTTKSSSHFKLDRGQ FKY IDGDVEIDRS-V gi|317056138 Ruminococcus albus LYHYTN EKGQ NA ILESN ELKP SLQ Y LSD LLY SP IQ LRP-NRYKYTH F IE ILVG LN gi|326796721 Marinomonas mediterranea VRHFTNSKGVQA IKESNM IKASDQQ LG IGRANHIEFNP ITGTERVFKGDVDLSNR gi|331672223 Escherichia coli VRHYANRKGSNA IAESG IIKAHDAKYQ IGRGRDYQ LNPRYHDELTVKGD IVLDNA gi|331682126 Escherichia coli VRHYTNRKGSNA IAESG IIKAHDAKYQ IGRGRDYQ LNPRYHDELTVKGD IVLDNA gi|336178949 Frankia symbiont MRHYTNVRGKEGILSSGVVRASDEYLG-IAGRHVRLNPTMNHEWTVTGDIAVDIK gi|345885614 Prevotella sp. VYHYTSKKGYN IINSQNP IKF--VYVTTKSSSHFMLDRGSFKYVEGGLEVPKK IK gi|349609192 Neisseria sp. YFHYTSEKGLSGILESNQILPSLQYLSNILSTNASLMPFQGNRFERFVAIDTGLN gi|353670284 Salmonella enterica VYHYTSKDGYNG IMGSGT IQTKDVYVTTLSSTHFKVDRQNGLRLY IEED IVLDVN gi|357206933 Salmonella enterica LRHYT SNQ G FAA IK ESM K ILAGDDK FK IKQ ARNYRVND L-G EEYK IKGD IELDGK gi|357398116 Streptomyces cattleya LYHYTT EDRAKQ IVD SQ ELSP SLQ Y LTD IAKTQ GQ LVPW AGQ K FTH F IE IDVG LD gi|358455331 Frankia sp. MRHYTNSQGVKGITSTGIIRASDDKLG-IAGRRVRLNPVMKDEWTATGDLPVNIT gi|365804660 Streptomyces cattleya LYHYTT EDRAKQ IVD SQ ELSP SLQ Y LTD IAKTQ GQ LVPW AGQ K FTH F IE IDVG LD gi|373493214 Solitalea canadensis LYHYT SEAGYNA IMDTKV LNP S IQY FTD IA FTKGQ TVPW NG SR LTH F IK IETG LN gi|374295898 Clostridium clariflavum LYHYTD EVG LNG ILN SKK LKP SLQ Y LSD IIKTPGQ LNP FQ GKR FTYY IE IDVD LN 686 gi|83646266 Hahella chejuensis LFHYTSEENLEN ILRTKKLFNSKQYMAD IS----Q LNKGNMDQ LSHY IEIDVDLI

687 Fig. S7. Evidence for horizontal transfer of genes into Aiptasia and other cnidarians

688 from prokaryotes and Symbiodinium. (A) Probable taxa of origin (based on protein- 49

689 alignment data; see SI Materials and Methods) of HGT candidates specific to Aiptasia

690 (left) and to cnidaria (right). Note that the 823 proteins shown in the right-hand graph

691 include the 275 proteins shown in the left-hand graph. Of the 29 cnidarian proteins

692 whose best alignment is to a Symbiodinium protein (right), 17 are Aiptasia-specific (left);

693 of the other 12, 10 are present in the N. vectensis genome and all 12 in the A. digitifera

694 genome [in contrast to a previous report (9)]. (B) Maximum-likelihood phylogenetic tree

695 (see SI Materials and Methods) for one of the 12 cnidarian HGT candidates of putative

696 dinoflagellate/Symbiodinium origin that is not specific to Aiptasia (see panel A). This

697 protein is a predicted Ca2+-activated chloride channel; similar results were obtained with

698 each of the three other proteins examined in this way [i.e. DHQS-O-MT fusion protein

699 (AIPGENE19170), 3-hydroxybutyrate dehydrogenase (AIPGENE6203), and heme

700 oxygenase (AIPGENE14419)]. The distinct, shaded clade includes proteins from

701 cnidarians (blue; one each from Aiptasia, A. digitifera, S. pistillata, and H.

702 magnipapillata), Symbiodinium (green; two each from S. minutum and S.

703 microadriaticum), and a variety of mostly unicellular eukaryotes [red; in order from the

704 top: one from a brown alga, three from three different species of diatoms, one from a

705 parasitic dinoflagellate, one from a foraminiferan, five from five different species of the

706 protozoan genus Leishmania, two from two species of flatworms (platyhelminthes), one

707 from the bacterial genus Clostridium, and two from the ciliate Tetrahymena]. (C,D)

708 Phylogenetic trees of 3-DHQS (C) and O-MT (D) domain sequences (see SI Materials

709 and Methods). Despite the report of a 3-DHQS-O-MT fusion protein in A. digitifera (9),

710 we were unable to identify any such gene in the data available from

711 www.marinegenomics.oist.jp, so no A. digitifera O-MT domain is included in panel (D).

712 Aiptasia, A. digitifera, and N. vectensis do all contain recognizable homologues of the

713 cyanobacterial enzymes "(ATP-grasp" and "NRPS-like") proposed to catalyze the next

714 two steps in the synthesis of the mycosporine amino acid shinorine. Numbers on nodes, 50

715 bootstrap values based on 1,000 iterations; grey, green, and blue shading,

716 cyanobacteria, dinoflagellates, and cnidarians, respectively. (E) Alignment of Aiptasia

717 (blue shading), Symbiodinium (green shading), and reference bacterial (grey shading)

718 Tox-Art-HYD1 domains after gap removal. Aiptasia TOX1, 4, and 6 are the three genes

719 identified initially as HGT candidates; TOX2, 3, and 5 were identified by their Pfam

720 annotations (see text). The closely related TOX4-6 are also closely linked in the

721 genome (see text). Colored shading highlights some of the conserved amino-acid

722 positions.

723

724 51

725 Supplemental References

726

727 53. Sunagawa S, et al. (2009) Generation and analysis of transcriptomic resources

728 for a model system on the rise: the sea anemone Aiptasia pallida and its

729 dinoflagellate endosymbiont. BMC Genomics 10:258.

730 54. Xiang T, Hambleton EA, DeNofrio JC, Pringle JR, & Grossman AR (2013)

731 Isolation of clonal axenic strains of the symbiotic dinoflagellate Symbiodinium

732 and their growth and host specificity1. Journal Phycol 49(3):447-458.

733 55. Pettay DT & LaJeunesse TC (2007) from clade B Symbiodinium

734 spp. specialized for Caribbean corals in the genus Madracis. Mol Ecol Notes

735 7(6):1271-1274.

736 56. LaJeunesse TC, Parkinson JE, & Reimer JD (2012) A genetics-based description

737 of Symbiodinium minutum sp. nov. and S. psygmophilum sp. nov. (Dinophyceae),

738 two dinoflagellates symbiotic with cnidaria. J Phycol 48(6):1380-1391.

739 57. Altshuler D, et al. (2000) An SNP map of the human genome generated by

740 reduced representation shotgun sequencing. Nature 407(6803):513-516.

741 58. Bolger AM, Lohse M, & Usadel B (2014) Trimmomatic: a flexible trimmer for

742 Illumina sequence data. Bioinformatics 30(15):2114-2120.

743 59. Gnerre S, et al. (2011) High-quality draft assemblies of mammalian genomes

744 from massively parallel sequence data. Proc Natl Acad Sci USA 108(4):1513-

745 1518.

746 60. Boetzer M & Pirovano W (2012) Toward almost closed genomes with GapFiller.

747 Genome Biol 13(6):R56.

748 61. Boetzer M, Henkel CV, Jansen HJ, Butler D, & Pirovano W (2011) Scaffolding

749 pre-assembled contigs using SSPACE. Bioinformatics 27(4):578-579. 52

750 62. Shoguchi E, et al. (2013) Draft assembly of the Symbiodinium minutum nuclear

751 genome reveals dinoflagellate gene structure. Curr Biol 23(15):1399-1408.

752 63. Bradnam KR, et al. (2013) Assemblathon 2: evaluating de novo methods of

753 genome assembly in three vertebrate species. GigaScience 2(1):10.

754 64. Price AL, Jones NC, & Pevzner PA (2005) De novo identification of repeat

755 families in large genomes. Bioinformatics 21 Suppl 1:i351-358.

756 65. Smit AFA, Hubley R, & Green PJ (1996-2010) RepeatMasker Open-3.0.

757 66. Rodríguez E, et al. (2014) Hidden among sea anemones: The first

758 comprehensive phylogenetic reconstruction of the order Actiniaria (Cnidaria,

759 Anthozoa, Hexacorallia) reveals a novel group of Hexacorals. PLoS One

760 9(5):e96998.

761 67. Gusmao LC & Daly M (2010) Evolution of sea anemones (Cnidaria: Actiniaria:

762 Hormathiidae) symbiotic with hermit crabs. Mol Phylogenet Evol 56(3):868-877.

763 68. Rodríguez E, Barbeitos M, Daly M, Gusmão LC, & Häussermann V (2012)

764 Toward a natural classification: phylogeny of acontiate sea anemones (Cnidaria,

765 Anthozoa, Actiniaria). Cladistics 28(4):375-392.

766 69. Larkin MA, et al. (2007) Clustal W and Clustal X version 2.0. Bioinformatics

767 23(21):2947-2948.

768 70. Suyama M, Torrents D, & Bork P (2006) PAL2NAL: robust conversion of protein

769 sequence alignments into the corresponding codon alignments. Nucleic Acids

770 Res 34(Web Server issue):W609-612.

771 71. Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum

772 likelihood. Comput Appl Biosci 13(5):555-556.

773 72. Langmead B & Salzberg SL (2012) Fast gapped-read alignment with Bowtie 2.

774 Nat Methods 9(4):357-359. 53

775 73. Grabherr MG, et al. (2011) Full-length transcriptome assembly from RNA-Seq

776 data without a reference genome. Nat Biotechnol 29(7):644-652.

777 74. The UniProt Consortium (2014) Activities at the Universal Protein Resource

778 (UniProt). Nucleic Acids Res 42(D1):D191-D198.

779 75. Altschul SF, Gish W, Miller W, Myers EW, & Lipman DJ (1990) Basic local

780 alignment search tool. J Mol Biol 215(3):403-410.

781 76. Dimmer EC, et al. (2012) The UniProt-GO Annotation database in 2011. Nucleic

782 Acids Res 40(Database issue):D565-570.

783 77. Haas BJ, et al. (2003) Improving the Arabidopsis genome annotation using

784 maximal transcript alignment assemblies. Nucleic Acids Res 31(19):5654-5666.

785 78. Stanke M, Steinkamp R, Waack S, & Morgenstern B (2004) AUGUSTUS: a web

786 server for gene finding in eukaryotes. Nucleic Acids Res 32(Web Server

787 issue):W309-312.

788 79. Kent WJ (2002) BLAT--the BLAST-like alignment tool. Genome Res 12(4):656-

789 664.

790 80. Liew YJ, et al. (2014) Identification of microRNAs in the coral Stylophora pistillata.

791 PLoS One 9(3): e91101.

792 81. Thomas PD, et al. (2003) PANTHER: a library of protein families and subfamilies

793 indexed by function. Genome Res 13(9):2129-2141.

794 82. Finn RD, et al. (2014) Pfam: the protein families database. Nucleic Acids Res

795 42(Database issue):D222-230.

796 83. Parra G, Bradnam K, & Korf I (2007) CEGMA: a pipeline to accurately annotate

797 core genes in eukaryotic genomes. Bioinformatics 23(9):1061-1067.

798 84. R Development Core Team (2012) R: A language and environment for statistical

799 computing (R Foundation for Statistical Computing, Vienna, Austria). 54

800 85. Kanehisa M & Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes.

801 Nucleic Acids Res 28(1):27-30.

802 86. Li H, et al. (2009) The Sequence Alignment/Map format and SAMtools.

803 Bioinformatics 25(16):2078-2079.

804 87. Roberts A & Pachter L (2013) Streaming fragment assignment for real-time

805 analysis of sequencing experiments. Nat Methods 10(1):71-73.

806 88. Robinson MD, McCarthy DJ, & Smyth GK (2010) edgeR: a Bioconductor

807 package for differential expression analysis of digital gene expression data.

808 Bioinformatics 26(1):139-140.

809 89. Howe EA, Sinha R, Schlauch D, & Quackenbush J (2011) RNA-Seq analysis in

810 MeV. Bioinformatics 27(22):3209-3210.

811 90. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and

812 high throughput. Nucleic Acids Res 32(5):1792-1797.

813 91. Castresana J (2000) Selection of conserved blocks from multiple alignments for

814 their use in phylogenetic analysis. Mol Biol Evol 17(4):540-552.

815 92. Ronquist F, et al. (2012) MrBayes 3.2: efficient Bayesian phylogenetic inference

816 and model choice across a large model space. Syst Biol 61(3):539-542.

817 93. Tavare S (1986) Some probabilistic and statistical problems in the analysis of

818 DNA sequences. Lectures Math Life Sci 17:57-86.

819 94. Fisher RA (1922) On the Interpretation of χ2 from Contingency Tables, and the

820 Calculation of P. J R Stat Soc 85(1):87-94.

821 95. Dunn OJ (1961) Multiple Comparisons among Means. J Am Stat Assoc

822 56(293):52-64.

823 96. Price MN, Dehal PS, & Arkin AP (2009) FastTree: computing large minimum

824 evolution trees with profiles instead of a distance matrix. Mol Biol Evol

825 26(7):1641-1650. 55

826 97. Suzek BE, Huang H, McGarvey P, Mazumder R, & Wu CH (2007) UniRef:

827 comprehensive and non-redundant UniProt reference clusters. Bioinformatics

828 23(10):1282-1288.

829 98. Capella-Gutierrez S, Silla-Martinez JM, & Gabaldon T (2009) trimAl: a tool for

830 automated alignment trimming in large-scale phylogenetic analyses.

831 Bioinformatics 25(15):1972-1973.

832 99. Darriba D, Taboada GL, Doallo R, & Posada D (2011) ProtTest 3: fast selection

833 of best-fit models of protein evolution. Bioinformatics 27(8):1164-1165.

834 100. Tamura K, Stecher G, Peterson D, Filipski A, & Kumar S (2013) MEGA6:

835 Molecular Evolutionary Genetics Analysis version 6.0. Mol Biol Evol 30(12):2725-

836 2729.

837 101. Kearse M, et al. (2012) Geneious Basic: an integrated and extendable desktop

838 software platform for the organization and analysis of sequence data.

839 Bioinformatics 28(12):1647-1649.

840 102. Katoh K & Standley DM (2013) MAFFT multiple sequence alignment software

841 version 7: improvements in performance and usability. Mol Biol Evol 30(4):772-

842 780.

843 103. Boschetti C, et al. (2012) Biochemical diversification through foreign gene

844 expression in bdelloid rotifers. PLoS Genet 8(11):e1003035.

845 104. Moran Y, et al. (2008) of sea anemone neurotoxin genes is

846 revealed through analysis of the Nematostella vectensis genome. Mol Biol Evol

847 25(4):737-747.

848 105. Zhang G, et al. (2012) The oyster genome reveals stress adaptation and

849 complexity of shell formation. Nature 490(7418):49-54.

850 106. Takeuchi T, et al. (2012) Draft genome of the pearl oyster Pinctada fucata: a

851 platform for understanding bivalve biology. DNA Res 19(2):117-130. 56

852 107. Flot JF, et al. (2013) Genomic evidence for ameiotic evolution in the bdelloid

853 rotifer Adineta vaga. Nature 500(7463):453-457.