G3: Genes|Genomes|Genetics Early Online, published on September 17, 2015 as doi:10.1534/g3.115.020164

1 De novo assembly and characterization of four anthozoan (phylum

2 ) transcriptomes

3

4 Sheila A. Kitchen * ^

5 Email: [email protected]

6 Camerron M. Crowder ^

7 Email: [email protected]

8 Angela Z. Poole

9 Email: [email protected]

10 Virginia M. Weis

11 Email: [email protected]

12 Eli Meyer

13 Email: [email protected]

14

15 Department of Integrative Biology, Oregon State University, 3029 Cordley Hall,

16 Corvallis, OR 97331, USA

17

18  Equal Contributors

19

20 Accession Numbers:

21 Raw data: NCBI’s SRA, accession SRP063463

22 Assemblies: DRYAD digital repository, doi:10.5061/dryad.3f08f

23

1

© The Author(s) 2013. Published by the Genetics Society of America. 24 Running Title: Four Anthozoan Transcriptomes

25

26 Keywords: , phylogenomics, non-model system, database

27

28 Corresponding Author:

29 Sheila Kitchen

30 Department of Integrative Biology

31 3029 Cordley Hall

32 Corvallis, OR 97330

33 USA

34

35 Phone: 703-673-6292

36 Email: [email protected]

37

38

39

40

41

42

43

44

2

45 ABSTRACT

46 Many non-model exemplify important biological questions but lack the sequence

47 resources required to study the genes and genomic regions underlying traits of interest.

48 Reef-building are famously sensitive to rising seawater temperatures, motivating

49 ongoing research into their stress responses and long-term prospects in a changing

50 climate. A comprehensive understanding of these processes will require extending

51 beyond the sequenced coral genome ( digitifera) to encompass diverse coral

52 species and related anthozoans. Toward that end, we have assembled and annotated

53 reference transcriptomes to develop catalogs of gene sequences for three scleractinian

54 corals (Fungia scutaria, Montastraea cavernosa, hystrix) and a temperate

55 anemone (Anthopleura elegantissima). High-throughput sequencing of cDNA libraries

56 produced ~20-30 million reads per sample, and de novo assembly of these reads produced

57 ~75-110 thousand transcripts from each sample with size distributions (mean ~ 1.4 kb,

58 N50~ 2 kb) comparable to the distribution of gene models from the coral genome (mean

59 ~1.7 kb, N50 ~ 2.2 kb). Each assembly includes matches for more than half the gene

60 models from A. digitifera (54-67%), and many reasonably complete transcripts (~5,300-

61 6,700) spanning nearly the entire gene (ortholog hit ratios ≥ 0.75). The catalogs of gene

62 sequences developed in this study made it possible to identify hundreds to thousands of

63 orthologs across diverse scleractinian species and related taxa. We used these sequences

64 for phylogenetic inference, recovering known relationships and demonstrating superior

65 performance over phylogenetic trees constructed using single mitochondrial loci. The

66 resources developed in this study provide gene sequences and genetic markers for several

67 anthozoan species. To enhance the utility of these resources for the research community,

3

68 we developed searchable databases enabling researchers to rapidly recover sequences for

69 genes of interest. Our analysis of de novo assembly quality highlights metrics that we

70 expect will be useful for evaluating the relative quality of other de novo transcriptome

71 assemblies. The identification of orthologous sequences and phylogenetic reconstruction

72 demonstrates the feasibility of these methods for clarifying the substantial uncertainties in

73 the existing scleractinian phylogeny.

74

75 INTRODUCTION

76 Transcriptome sequencing provides a rapid and cost-effective approach for gene

77 discovery in non-model organisms. Analysis of transcriptomes from a diverse range of

78 invertebrates such as sponges (Riesgo et al. 2014; Conaco et al. 2012), ctenophores (Ryan

79 et al. 2013), annelids (Riesgo et al. 2012), and molluscs (Riesgo et al. 2012; Kocot et al.

80 2011) has enhanced comparative and evolutionary studies of metazoans. Quantitative

81 analysis of these sequences (RNA-Seq) has become the method of choice to profile

82 genome-wide transcription levels. This technique provides an unbiased approach to

83 discovering functional processes through identification and quantification of

84 differentially expressed genes between phenotypic states including experimental

85 treatments (Meyer et al. 2011), tissue types (Siebert et al. 2011), and developmental

86 stages (Graveley et al. 2011).

87

88 Genomic and transcriptomic resources have been developed for a variety of species

89 within the phylum Cnidaria (Moya et al. 2012; Barshis et al. 2013; Fuchs et al. 2014;

90 Helm et al. 2013; Lehnert et al. 2012; Meyer et al. 2011; Meyer et al. 2009; Polato et al.

4

91 2011; Shinzato et al. 2014; Soza‐Ried et al. 2010; Traylor-Knowles et al. 2011; Wenger

92 and Galliot 2013; Sun et al. 2013; Meyer and Weis 2012; Lehnert et al. 2014), a diverse

93 group of evolutionarily and ecologically significant species that range from hydroids

94 (Class Hydrozoa) and jellyfish (Class Medusozoa) to sea anemones and corals (Class

95 ). Cnidarians are among early-diverging or basal metazoans and occupy a key

96 position as a sister taxon to the bilaterians (Dunn et al. 2008). Many cnidarians play an

97 important role in marine trophic cascades, due to their mutualistic relationship with

98 species of the that reside inside of cnidarian host

99 cells. This relationship is based on nutritional exchange in which Symbiodinium spp.

100 provide the cnidarian host with products from photosynthesis in return for inorganic

101 nutrients and a stable, high-light environment (Davy et al. 2012). The paramount

102 examples of this partnership are the reef-building corals, which form the trophic and

103 structural foundation of productive and biodiverse coral reef ecosystems. Anthropogenic

104 stressors, especially those associated with global climate change, are gravely threatening

105 these reef ecosystems, including the corals themselves (Douglas 2003; Weis and

106 Allemand 2009). Insight into the molecular mechanisms that underlie coral-dinoflagellate

107 symbioses and their stress response to environmental perturbation is critical for future

108 management and conservation of coral reef ecosystems.

109

110 To date, there are two publically available, sequenced genomes from the Anthozoa: the

111 symbiotic coral Acropora digitifera (Shinzato et al. 2011) and the non-symbiotic sea

112 anemone, Nematostella vectensis (Putnam et al. 2007). These genomes have provided

113 insight into the genomic complexity of cnidarians, furthering studies of gene evolution

5

114 and function across basal metazoans (Poole and Weis 2014; Putnam et al. 2007; Shinzato

115 et al. 2011; Marlow et al. 2009; Ryan et al. 2006; Hamada et al. 2012; Shinzato et al.

116 2012a; Shinzato et al. 2012b; Wood-Charlson and Weis 2009; Dunn et al. 2008).

117 Comparison of these genomes has revealed putative symbiosis-associated genes that may

118 function in the onset and maintenance of cnidarian-dinoflagellate symbiosis (Meyer and

119 Weis 2012). Annotated de novo transcriptomes, generated using NGS (expressed

120 sequence tags (ESTs), 454 pyrosequencing and Illumina HiSeq technologies), have been

121 published for 8 genera of anthozoans (Polato et al. 2011; Kenkel et al. 2013; Meyer et al.

122 2009; Traylor-Knowles et al. 2011; Lehnert et al. 2012; Pratlong et al. 2015; Shinzato et

123 al. 2014; Vidal-Dupiol et al. 2013). These resources have been used in variety of

124 contexts including the study of gene family evolution (Poole and Weis 2014), symbiosis-

125 enhanced gene expression (Lehnert et al. 2014) and responses to environmental stressors

126 such as elevated seawater temperature (Meyer et al. 2011; Kenkel et al. 2013), bacterial

127 infection (Closek et al. 2014), and CO2-driven changes in seawater pH (Vidal-Dupiol et

128 al. 2013). These studies are adding to earlier generation omics studies (EST studies,

129 subtractive hybridization and cDNA microarrays (Meyer and Weis 2012)) and are

130 providing information on the mechanisms of cnidarian-dinoflagellate symbiosis and coral

131 bleaching, a stress response that results from the breakdown of the partnership (Weis

132 2008; Davy et al. 2012). Expression studies are therefore contributing not only to our

133 basic understanding of cellular processes in cnidarians, but also to our ability to link

134 molecular responses with phenotypic change due to environmental perturbation.

135

6

136 The available anthozoan resources are limited in taxonomic diversity, and dominated by a

137 few genera from a narrow geographic range (Meyer and Weis 2012). In addition, many

138 resources are from aposymbiotic (lacking dinoflagellate symbionts) samples or non-

139 symbiotic species, which limits the study of interplay between the two partners. One goal

140 of this work is to increase the number and diversity of anthozoan resources for

141 comparative, phylogenetic and functional analyses.

142

143 In this study, we present transcriptomes from four anthozoans: the sea anemone

144 Anthopleura elegantissima (Brandt, 1835) and the corals Fungia scutaria (Lamarck,

145 1801), Montastraea cavernosa (Linnaeus, 1767), and (Dana, 1846) in

146 varying symbiotic states, life history stages and geographic locations (Table 1). These

147 species are of particular interest to investigations into the molecular mechanisms

148 associated with the onset, maintenance and breakdown of cnidarian-dinoflagellate

149 symbioses. We highlight how these transcriptomes can be used in applications ranging

150 from targeted gene searches to orthologous group predictions and phylogenomic analysis.

151 In addition, we outline a method to screen for cross-contamination between sequencing

152 libraries that can be broadly applied to other transcriptome studies.

153

154 MATERIALS AND METHODS

155 Sample collection and RNA extraction

156 All four anthozoan species examined in this study engage in symbiosis with

157 Symbiodinium spp., and therefore RNA extractions typically include contributions from

158 the dinoflagellate symbionts at some level. Here, two samples (M. cavernosa and S.

7

159 hystrix) were collected from symbiotic specimens and two samples (F. scutaria and A.

160 elegantissima) were collected from nominally aposymbiotic stages or specimens (Table

161 1). Larvae of F. scutaria were reared in filtered seawater at Hawaii Institute of Marine

162 Biology following fertilization and development, and remained symbiont-free during

163 development (Schnitzler and Weis 2010). The aposymbiotic specimen of A.

164 elegantissima was collected in that condition in the field.

165

166 Total RNA was extracted from S. hystrix, F. scutaria, and A. elegantissma using the

167 following methods. S. hystrix tissue was stored in RNAlater® Stabilization Solution

168 (Qiagen, CA, US) and RNA was extracted using the RNeasy Mini Kit (Qiagen, CA, US)

169 according to the manufacturer’s protocol. Whole specimens of A. elegantissima

170 (aposymbiotic) and F. scutaria (larvae) were collected, frozen in liquid nitrogen and

171 stored at -80. RNA was extracted using a combination of the TRIzol® RNA isolation

172 protocol (Life Technologies, CA, US) and RNeasy Mini Kit (Qiagen, CA, US). The

173 TRIzol® protocol was used for initial steps up to and including the chloroform

174 extraction. Following tissue homogenization, an additional centrifugation step was

175 performed at 12,000 x g for 10 min to remove tissue debris. After the chloroform

176 extraction, the aqueous layer was combined with equal volume of 100% EtOH and the

177 RNeasy Mini Kit was used to perform washes following the manufacturer’s protocol.

178

179 A core sample of M. cavernosa was collected, frozen in liquid nitrogen and stored at -

180 80. Total RNA from M. cavernosa was extracted following a modified TRIzol®

181 protocol with a 12 M LiCl precipitation (Mazel et al. 2003). Briefly, the coral fragment

8

182 was vortexed in TRIzol® reagent for 15 min and then processed according to the

183 manufacturer’s instructions through phase separation. To precipitate RNA, 0.25 ml of

184 isopropanol and 0.25 ml of a high salt solution (0.8 M sodium citrate and 1.2 M NaCl)

185 per 1 ml of TRIzol® used was added to the aqueous supernatant. The addition of high

186 salt solution removes proteoglycan and polysaccharide contaminates. The solution was

187 incubated at room temperature for 10 min and then centrifuged at 12,000 x g for 10 min

188 at 4°. After centrifugation, the standard TRIzol® protocol was followed through the

189 ethanol wash. To remove PCR inhibitors of an unknown nature that are frequently

190 encountered in coral samples, RNA was precipitated by adding an equal volume of 12 M

191 LiCl, then incubated for 30 min at -20°. The sample was centrifuged at 12,000 x g for 15

192 min at room temperature and washed with 75% EtOH (1 ml per 1 ml of TRIzol®)

193 followed by centrifugation at 7,500 x g for 5 min at room temperature. The supernatant

194 was removed and the RNA pellet was air-dried.

195

196 The extracted total RNA from each sample was DNase-treated using a TURBO DNA-

197 Free Kit (Ambion, CA, US) to remove genomic DNA contamination. RNA quantity and

198 quality was assessed using the NanoDrop ND-1000 UV-Vis Spectrophotometer (Thermo

199 Scientific, MA, US) and gel electrophoresis.

200

201 Preparation of sequencing libraries

202 Polyadenylated RNA was purified from 10 µg of total RNA using the Magnetic mRNA

203 Isolation Kit (New England Biolabs, MA, US). First strand cDNA was synthesized using

204 ProtoScript M-MuLV FS-cDNA Synthesis Kit (New England Biolabs, MA, US)

9

205 according to the manufacturer’s protocol and modified oligonucleotides in Table S1 in

206 File S1. Second strand synthesis was performed by incubating first-strand cDNA with 1x

207 NEBNext Second Strand Synthesis Buffer (New England Biolabs, MA, US), 0.2 mM

208 dNTPs, 15 units of E.coli DNA Ligase (New England Biolabs, MA, US), 75 units of E.

209 coli DNA polymerase I (New England Biolabs, MA, US) and 3 units of RNase H (New

210 England Biolabs, MA, US) for 2 hr at 16°. cDNA was purified using the GeneJet PCR

211 Purification Kit (Fermentas, MA, US) then fragmented using NEBNext dsDNA

212 Fragmentase (New England Biolabs, MA, US) according to the manufacturer’s protocol,

-1 213 with the addition of 5 mM MgCl2 and 1 mg ml BSA (New England Biolabs, MA, US).

214 Fragmented cDNA was purified and the ends repaired using NEB Quick Blunting Kit

215 (New England Biolabs, MA, US) according to manufacturer’s protocol. The product was

216 purified and A-tailed in a reaction with nuclease-free water, 1x NEB Standard Taq Buffer

217 (New England Biolabs, MA, US), 1 mM dATP, and 2 units of NEB Standard Taq (New

218 England Biolabs, MA, US) at 68° for 2 hours. Tailed templates were ligated to double

219 stranded adaptors prepared with oligonucleotides from the Illumina Customer Sequence

220 Letter (version August 12, 2014 (Illumina 2014), Table S1 in File S1). Purified, tailed

221 cDNA was combined with T4 DNA Ligase Buffer (New England Biolabs, MA, US), T4

222 DNA Ligase (New England Biolabs, MA, US), and the double stranded adaptors and the

223 solution was incubated at 12° for at least 6 hours. Ligation products were purified then

224 amplified using custom sample-specific barcodes (“indices”) designed with a 3-bp

225 minimum Hamming distance based on Illumina barcodes (Illumina 2014) (Table S1 in

226 File S1). PCR included template cDNA, Phusion Taq polymerase buffer (Thermo

227 Scientific, MA, US), dNTPs, 5’ Illumina “i5” barcoding oligo and 3’ Illumina “i7”

10

228 multiplex oligonucleotide and Phusion High Fidelity Taq polymerase (Thermo Scientific,

229 MA, US). Reactions were incubated at 98° for 30 seconds, followed by 17-21 cycles of:

230 98° for 10 s, 63° for 30 s, 72° for 1.5 min. Reactions were amplified for the minimum

231 cycle number required to produce a visible product on a 1% agarose gel. PCR products

232 were size-selected by excising the 350-550 bp fraction from a 2% agarose gel. Finally,

233 size-selected sequencing libraries were extracted using the E.Z.N.A. Gel Extraction Kit

234 (Omega Bio-Tek, GA, US).

235

236 Sequencing, processing and assembly

237 cDNA libraries were sequenced on Illumina HiSeq 2000 at University of Oregon’s

238 Genomics Core Facility (Eugene, OR). All cDNA libraries were pooled on a single lane

239 to produce 100 bp paired-end reads. Raw sequences were filtered using custom Perl

240 scripts to remove uninformative (matching adaptors in Table S1 in File S1, or poly-A

241 tail) and low quality reads (> 20 positions with quality scores < 20) (Meyer et al. 2011).

242 All custom scripts used in this study are available online at GitHub

243 (https://github.com/Eli-Meyer). The high-quality filtered reads were then assembled

244 using default settings in Trinity v2.0.2, a de Bruijn graph based assembler that uses

245 paired-end data to reconstruct transcripts and group these into components intended to

246 represent the collection of transcripts originating from a single gene (Grabherr et al.

247 2011).

248

249 Functional annotation

11

250 To develop these assemblies as resources for functional studies, we assigned putative

251 gene names and functional categories (Gene Ontology, GO; and Kyoto Encyclopedia of

252 Genes and Genomes; KEGG) to assembled transcripts based on sequence comparisons

253 with online databases. All sequence comparisons were conducted using BLAST+ from

254 National Center for Biotechnology Information (NCBI) (Package version 2.2.29)

255 (Altschul et al. 1990). Gene names were assigned by comparing transcript sequences

256 against UniProt protein sequence databases (SwissProt and TREMBL) using BLASTx

257 with an expect value (E value) cutoff of 10-4. Each transcript was assigned a gene name

258 based on its best match, excluding matches with uninformative names (e.g.

259 uncharacterized, unknown, or hypothetical). GO terms describing biological processes,

260 molecular functions, and cellular components were assigned to each transcript based on

261 GO-UniProt associations of its best match, downloaded from the Gene Ontology website

262 (The Gene Ontology et al. 2000). KEGG orthology terms were assigned from single-

263 directional best hit BLAST searches of each transcriptome on the KEGG Automatic

264 Annotation Server (Moriya et al. 2007).

265

266 Reference transcriptome databases

267 The sequence data used in this study have been archived in several public repositories.

268 Raw sequence data have been deposited in the Sequence Read Archive at NCBI

269 [Accession number: SRP063463]. The annotated assemblies have been archived at the

270 Dryad Digital Repository [Accession number: doi:10.5061/dryad.3f08f].

271

12

272 To enhance the utility of these resources for the coral research community, we have also

273 developed searchable databases and made these publicly available on the author’s

274 laboratory website hosted at Oregon State University (Meyer). Databases were produced

275 using the open-source SQLite software library and can be queried directly using a

276 publicly accessible web form. To demonstrate the utility of our searchable databases for

277 rapidly identifying genes of interest, we searched each database for a few genes

278 previously studied in cnidarians, including a cell adhesion molecule (sym32) (Reynolds

279 et al. 2000), a cysteine biosynthesis enzyme (cystathionine β-synthase, Cbs) (Shinzato et

280 al. 2011), and a fluorescent protein (GFP) (Mazel et al. 2003; Shinzato et al. 2012b;

281 Smith-Keune and Dove 2008). For comparison with these simple text searches, we also

282 conducted a more comprehensive search for each gene based on reciprocal BLAST.

283 Representative sequences for each gene were obtained from the UniProt database

284 (version 2014_09, downloaded October 20, 2014), and searched against each assembly

285 using tBLASTn (bit-score ≥ 45). The matching transcripts were then reciprocally

286 compared against UniProt using BLASTx. Reciprocal matches were evaluated at the

287 level of gene names: transcripts identified by searching with a target gene (e.g. B5T1L4,

288 GFP from Acropora millepora) were accepted if they reciprocally matched a different

289 gene with corresponding annotation (e.g. Q9U6Y6, a GFP gene from Anemonia

290 manjano).

291

292 Evaluating gene content and completeness of assembly

293 An ideal reference transcriptome would include all genes present in the genome of an

294 organism but low or tissue-specific expression can lead to incomplete sampling of genes

13

295 during cDNA library preparation. To evaluate the gene representation of our assemblies,

296 we searched each assembly for sequence similarity with a core set of conserved

297 eukaryotic genes (CEGMA; (Parra et al. 2007)) and with gene models from sequenced

298 anthozoan genomes: the coral Acropora digitifera [OIST: adi_v1.0.1] (Shinzato et al.

299 2011) and the anemone Nematostella vectensis [assembly version: Nemve1] (Putnam et

300 al. 2007). Sequence comparisons were conducted using NCBI’s BLASTx Basic Local

301 Alignment Search Tool (Altschul et al. 1990), and bit-scores ≥ 50 considered significant.

302

303 An ideal transcriptome assembly would also include complete transcripts as contiguous

304 sequences or contigs, but variation in coverage and sequence characteristics lead to

305 fragmented assemblies consisting of partial transcripts. To evaluate the effectiveness of

306 our assemblies in reconstructing complete transcripts, we calculated the Ortholog Hit

307 Ratio (OHR), a metric ranging from 0-1 that indicates the proportion of each gene

308 included in the assembled transcript (O'Neil et al. 2010). Each assembly was compared to

309 gene models from the N. vectensis genome using BLASTx to identify orthologs. We

310 calculated OHR first with a relatively stringent approach (OHRHITS), as the proportion of

311 each N. vectensis gene included within local alignments with assembled transcripts (HSPs

312 in BLASTx output). Since this approach excludes divergent regions, we calculated OHR

313 with an alternative and more inclusive approach (OHRORF), as the ratio of the transcript’s

314 longest ORF (in the BLASTx-defined reading frame) relative to the length of its

315 corresponding N. vectensis protein. When multiple transcripts matched a single gene we

316 considered only the longest OHR. Distributions of maximum OHR scores and summary

317 statistics were examined to evaluate the completeness of each assembly.

14

318

319 Screening for biological contamination

320 All species used in this study engage in symbiotic associations with intracellular

321 dinoflagellate symbionts and therefore RNA extracted from these specimens is expected

322 to include contributions from both animal hosts and dinoflagellate symbionts. To evaluate

323 these contributions we conducted a series of sequence comparisons aiming to identify the

324 taxonomic origin of each transcript (Figure 1). Transcripts were compared with a series

325 of sequence databases using BLAST v.2.2.29 with a bit-score threshold of 45. To identify

326 transcripts derived from rRNA, each assembly was compared with cnidarian rRNA

327 sequences using BLASTn. N. vectensis sequences were chosen for this purpose as they

328 represent the most complete cnidarian sequences in the SILVA rRNA database [SILVA:

329 ABAV01023297, ABAV01023333] (Quast et al. 2012). Transcripts were compared with

330 a cnidarian mitochondrial genome using BLASTn; for this analysis, we chose the

331 complete mitochondrial genome from Acropora tenuis [NCBI: NC_003522.1] (van

332 Oppen et al. 2002). To identify the taxonomic origin of each transcript, sequences were

333 compared with the NCBI non-redundant (nr) protein database (downloaded March 12,

334 2014) using BLASTx (E value  10-5) (Altschul et al. 1990). To avoid errors that might

335 arise from the scarcity of cnidarian and dinoflagellate sequences in these databases,

336 transcripts were compared with gene models from Symbiodinium minutum (clade B)

337 [OIST: symbB.v1.2.augustus.prot] and A. digitifera [OIST: adi_v1.0.1_prot] using

338 BLASTx. The taxonomic origin of each sequence was categorized as follows. First,

339 transcripts matching rRNA or mitochondrial sequences were assigned to those categories.

340 Transcripts matching Symbiodinium genes more closely than coral genes, that did not

15

341 return a metazoan hit as their best match in nr, were assigned to the dinoflagellate

342 category. Transcripts matching coral genes more closely than Symbiodinium genes, that

343 also matched metazoan records or lacked matches in nr, were categorized as metazoan.

344 Transcripts that showed conflicting results (metazoan in one db but non-metazoan in the

345 other) were categorized as “unknown”. Transcripts lacking any match to either coral or

346 Symbiodinium genes were assigned based on taxonomic annotation of the best match in

347 nr, if available. This series of decisions made it possible to classify each transcript based

348 on origin (ribosomal, mitochondrial, other metazoan genes, dinoflagellate, or “other

349 taxa”, which includes prokaryotes, “uncertain”, or “no match”).

350

351 Screening for cross-contamination

352 During preliminary analysis of the transcriptome assemblies, we observed a few

353 orthologs with unexpectedly high sequence similarity (> 99%) among species.

354 Since cross-contamination could realistically occur at several different stages during

355 multiplex library preparation and sequencing, we tested for evidence of cross-

356 contamination in our transcriptome assemblies and developed a pipeline to eliminate

357 contaminating sequences. To evaluate the extent of cross contamination in our libraries,

358 we mapped the cleaned reads used to produce each assembly against that assembly using

359 the Trinity utility align_and_estimate_abundance.pl (Haas et al. 2013). We then

360 compared all transcriptome libraries sequenced and prepared together using BLASTn to

361 identify nearly identical sequences present in multiple assemblies (bit-score ≥ 100). This

362 analysis identified many sequences occurring in multiple assemblies, which were highly

363 abundant in one sample (consistent with this being their true origin) but very low

16

364 abundance (<10-fold lower) in other assemblies (consistent with cross-contamination).

365 To evaluate the level of sequence similarity expected among anthozoan transcriptomes,

366 for comparison with the similarity observed among our assemblies, we compared

367 publicly available transcript assemblies produced independently in different labs

368 (Pocillopora damicornis, (Traylor-Knowles et al. 2011); A. digitifera, (Shinzato et al.

369 2011); and A. millepora, (Meyer et al. 2009)). To eliminate putative cross-contaminants

370 identified in our assemblies, we first compared assemblies using BLASTn to identify

371 highly similar sequences (bit-score ≥ 100). We then estimated the abundance of each

372 transcript in each assembly by mapping and counting reads from each library against the

373 assembly produced from those reads, using the Trinity utility

374 align_and_estimate_abundance.pl. To identify and remove sequences that might result

375 from cross-contamination, we categorized each transcript based on sequence similarity

376 and relative expression in all other assemblies. Any transcripts with nearly-identical

377 matches in more than one assembly were assigned to the assembly in which each was

378 most abundant, if the sequence was at least 10-fold more abundant in that library than any

379 others. Alternatively, transcripts found at comparable levels (< 10-fold difference) in

380 multiple assemblies were flagged as “unknown origin” and excluded from further

381 analysis.

382

383 Development of SSR markers

384 Simple sequence repeats (SSRs), also known as microsatellites, are sequences with

385 repetitive 2-5 base pairs of DNA. These molecular markers have been widely used for

386 studies of genome mapping, genetic linkage and population structure. Although SSRs

17

387 have largely been replaced with sequencing-based approaches for single nucleotide

388 polymorphism (SNP) genotyping, in some situations they may still be the most practical

389 option. To demonstrate the utility of transcriptome assemblies for SSR marker

390 development and identify SSR markers for the four species described here, we used a

391 pipeline we have previously described for identifying SSRs in coral sequence data

392 (Davies et al. 2013). In brief, sequences containing repetitive regions (≥ 30 bp, ≤ 15%

393 deviation from perfect repeat structure, ≥ 30 bp flanking regions) were identified using

394 RepeatMasker (Smit et al. 1996-2010), and then assembled using CAP3 to eliminate

395 redundancy (Huang and Madan 1999). Target sequences were further screened for

396 redundancy using BLASTn (Altschul et al. 1990) to identify unique targets within each

397 repeat type (e.g., AT, CCG, etc.). Finally, primer sequences flanking these SRRs were

398 developed using Primer3 (Rozen and Skaletsky 1999), targeting regions 150-500 bp with

399 45-65% GC content.

400

401 Identification of orthologous groups

402 To facilitate comparative studies of cnidarian gene sequences, and demonstrate the utility

403 of our transcriptome assemblies for phylogenetic analysis, we identified orthologous

404 groups among the four transcriptomes generated in this study. We also compared these

405 with sequence resources from other cnidarians and basal metazoans, including a marine

406 sponge Amphimedon queenslandica (Srivastava et al. 2010), the hydrozoan Hydra

407 magnipapillata (Chapman et al. 2010), the schyphozoan Aurelia aurita (Fuchs et al.

408 2014), and a variety of other anthozoans including Aiptasia pallida (Lehnert et al. 2012),

409 N. vectensis (Putnam et al. 2007), A. digitifera (Shinzato et al. 2011), Porites asteroides

18

410 (Kenkel et al. 2013), P. damicornis (Vidal-Dupiol et al. 2013), Stylophora pistillata

411 (Karako-Lampert et al. 2014), Orbicella faveolata (formerly belonging to the genus

412 Montastraea (Budd and Stolarski 2011; DeSalvo et al. 2008)), and Pseudodiploria

413 strigosa (Table S2 in File S1). These resources varied in the types of sequencing

414 technologies used to create them and this resulted in differing degrees of assembly

415 completeness, ranging from whole genomes to EST libraries (Table S2 in File S1). All

416 resources were converted into candidate protein coding sequences using the package

417 TransDecoder (transdecoder.sourceforge.net) that identifies open reading frames. Protein

418 sequences were then processed with FastOrtho (enews.patricbrc.org/fastortho), an

419 OrthoMCL based program (Li et al. 2003) that performs an all-by-all BLAST of the input

420 sequences (E value cutoff  10-5) and clusters orthologous groups with the Markov

421 Cluster algorithm (Van Dongen 2000).

422

423 Phylogenetic analysis

424 The four transcriptomes from this study and other sequence resources were used to infer

425 phylogenetic relationships from commonly used markers and newly identified orthologs.

426 The mitochondrial gene cytochrome c oxidase 1 (COI) has been used to reconstruct the

427 most comprehensive phylogeny of corals (Anthozoa, Scleractina) (Kitahara et al. 2010)

428 and mitochondrial sequences are commonly used to infer evolutionary relationships of

429 the Cnidaria (Kitahara et al. 2010; Bridge et al. 1992; Kayal et al. 2013). Recent findings

430 suggest, however, that a concatenated set of NADH dehydrogenase genes (ND 2, 4 and 5)

431 outperforms COI in metazoan datasets including in anthozoans (Havird and Santos 2014).

432

19

433 To investigate the effect of increased gene sampling on phylogenetic inferences, we

434 compared phylogenetic trees constructed based on (a) the widely-used marker COI, (b)

435 the ND supergene, and (c) the set of orthologs identified from a comparison of our

436 transcriptomes with other cnidarian sequence resources. All taxa used in searches for

437 orthologous groups were included and A. queenslandica served as the outgroup. The

438 Transdecoder catalog of proteins for each organism was made into a local BLAST

439 database. Then, the mitochondrial protein sequences of COI, ND2, ND4 and ND5 were

440 found from BLASTx searches against our local databases, UniProt or NCBI databases

441 (Tables S3 and S4 in File S1). In some cases, mitochondrial genes were not recovered

442 from the local protein databases, but were found by tBLASTx to the original resources.

443 These transcripts were instead translated using Expasy Translate Tool

444 (http://web.expasy.org/translate/) under the “invertebrate mitochondrial” genetic code.

445 Proteins sequences for COI, ND2, ND4 and ND5 were aligned using MAFFT v6.864b

446 (Katoh et al. 2002). In some cases, the mitochondrial sequences were fragmented within a

447 single database or recovered from two separate databases (Tables S3 and S4 in File S1).

448 These fragments were aligned and manually combined to increase total alignment

449 positions. Individual MAFFT alignments of ND2, ND4 and ND5 were concatenated into

450 a single matrix in Mesquite (v. 3.02) (Maddison and Maddison 2011). Protein alignments

451 of COI and the ND genes were run through ProtTest server

452 (http://darwin.uvigo.es/software/prottest_server.html) (Abascal et al. 2005) to select the

453 appropriate substitution rate model based on AIC and BIC criterion. Phylogenetic trees

454 were constructed using maximum likelihood (ML) in RAxML v. 8.0.26 (Stamatakis

455 2014) under the MTZOA+G+F model (Rota-Stabelli et al. 2009). Optimal topology was

20

456 selected based on ML scores from 500 replicate trees. Nodal support was assessed from

457 500 bootstrap replicates.

458

459 For phylogenomic reconstruction, the computational pipeline PhyloTreePruner (Kocot et

460 al. 2013) was applied to orthologous groups with a minimum amino acid length of 100

461 from the 15 taxa identified in Table S2 in File S1. PhyloTreePruner is a phylogenetic

462 approach used to refine orthologous groups identified in programs like OrthoMCL by

463 removing predicted paralogs resulting from gene duplication or splice variants through

464 single gene-tree evaluation (Kocot et al. 2013). First, each group of orthologs was aligned

465 using MAFFT v. 6.864b with 1000 iterations. Ambiguous or uninformative positions

466 were removed from the alignment using Gblocks v. 0.91b (Castresana 2000). Then,

467 single-gene ML trees for each group inferred with FastTree2 (Price et al. 2010) were

468 screened for paralogy with PhyloTreePruner and the longest sequence for each taxon was

469 retained. The pruned orthologous groups were then merged into a single matrix using

470 FASconCAT v. 1.0 (Kück and Meusemann 2010). To examine the impact of missing data

471 on tree topology, two trees were constructed. In the conservative tree, 14-15 taxa were

472 sampled per ortholog for a total of 397 groups (73,833 unique alignment positions). The

473 relaxed tree allowed more missing data, requiring only at least 10 taxa sampled per

474 ortholog for a total of 2,896 groups (535,413 unique alignment positions). For each

475 dataset, ML trees were inferred with RAxML v. 8.0.26 using the WAG+GAMMA+F

476 substitution model (Whelan and Goldman 2001). Topology for each tree was selected

477 from 100 replicate trees, and nodal support values are based on 100 and 500 bootstrap

478 replicates in the conservative and relaxed trees respectively.

21

479

480 RESULTS AND DISCUSSION

481 Sequencing and de novo assembly

482 The four libraries described here were sequenced on Illumina HiSeq 2000 (each

483 occupying 1/6th of a lane), yielding on average 26.3 million paired reads per library

484 (range: 21.2-30.3, Table S5 in File S1). A fraction of these (22% on average; range 14-

485 28%) were removed during quality and adaptor filtering prior to assembly. Assembly of

486 the remaining high-quality reads produced on average ~170,000 transcripts. This is

487 substantially higher than the number of genes in sequenced cnidarian genomes (23,677 in

488 A. digitifera, 27,273 in N. vectensis), which likely results from redundancy,

489 fragmentation in the assemblies and biological contamination. Assemblies included many

490 small contigs (on average, 47% were < 400 bp) that were unlikely to provide significant

491 matches, so for analyses based on sequence homology we considered only contigs ≥ 400

492 bp (average n=91,792). For these core transcriptome datasets used for downstream

493 analyses, the average length ranged from 1.1-1.7 kb and N50 ranged from 1.4-2.7 kb.

494 These are slightly shorter than the expected size distribution for a complete cnidarian

495 transcriptome (e.g. average ~ 1,700 and N50 ~ 2,200 bp transcripts in the A. digitifera

496 genome), suggesting incomplete assemblies. Assembly statistics of the four transcriptome

497 references developed in this study are broadly comparable to previously published

498 anthozoan transcriptomes (Moya et al. 2012; Shinzato et al. 2014; Shinzato et al. 2011;

499 Abascal et al. 2005; Traylor-Knowles et al. 2011; Lehnert et al. 2012).

500

501 Completeness of transcriptomes

22

502 To evaluate the completeness of the transcriptome assemblies from the perspective of

503 gene content, we conducted sequence comparisons with conserved eukaryotic genes and

504 gene models from sequenced relatives. The core eukaryotic genes (CEGMA; (Parra et al.

505 2007)) are expected to be expressed in most eukaryotes (Nakasugi et al. 2013; Sanders et

506 al. 2014) and are widely used to estimate transcriptome completeness. Sequence

507 comparisons revealed matches for 453 of these conserved genes (98.9%) in A.

508 elegantissma and 456 (99.5%) in F. scutaria, M. cavernosa and S. hystrix (Figure 2a).

509 For a more comprehensive view of gene representation, the transcriptomes were

510 compared with gene models from sequenced relatives (the coral A. digitifera and the

511 anemone N. vectensis). This analysis identified matches for more than 14,000 gene

512 models in each genome (BLASTx, bit-score ≥ 50): 54-67% of gene models in A.

513 digitifera (Figure 2b) and 48-49% in N. vectensis. This is comparable to the level of

514 sequence similarity observed among anthozoans with completed genomes. BLASTp

515 comparisons of predicted proteins from the A. digitifera and N. vectensis genomes using

516 the same thresholds recover 35% and 42% of genes in the other genome. This is

517 substantially lower than the optimistic estimates of representation based on CEGMA,

518 perhaps reflecting essential functions and constitutive expression of these highly

519 conserved genes. Comparisons with gene models of closely related taxa appear to provide

520 a more conservative estimate of gene representation in transcriptome assemblies.

521

522 To evaluate the effectiveness of our assemblies in reconstructing complete transcripts, we

523 calculated ortholog hit ratios (OHR) for each final assembly. This method estimates the

524 amount of a de novo transcript contained in the best ortholog from a reference genome

23

525 (O'Neil et al. 2010), ranging from 1 (for complete transcripts) to 0 (for transcript

526 fragments). We calculated OHR based on sequence comparisons with N. vectensis gene

527 models, using two approaches. First, a relatively stringent analysis based on the

528 proportion of each N. vectensis gene included in regions of local similarity (OHRHITS)

529 produced median OHR of 63.8, 64.7, 65.7, and 58.0% for A. elegantissma, F. scutaria,

530 M. cavernosa and S. hystrix, respectively (Figure 2c). A more inclusive analysis based on

531 the longest ORF (in BLAST defined frame) produced similar estimates (median OHRORF:

532 67.4, 75.8, 77.2, and 60.3% respectively). Each assembly included more than 5,000

533 reasonably complete transcripts spanning at least 75% of the corresponding N.vectensis

534 gene (range: 5,262-6,725). Overall, these comparisons with existing cnidarian sequence

535 resources quantify the representation and completeness of our assemblies, and provide a

536 framework for comparison with other de novo assemblies. These estimates compare

537 favorably with previous transcriptome completeness estimates for cnidarians (Sanders et

538 al. 2014) and several invertebrates (O'Neil and Emrich 2013; Riesgo et al. 2012) using

539 similar methods.

540

541 Annotation of transcriptomes

542 Transcripts were annotated using BLAST homology searches against the UniProt

543 databases. Approximately a third of all transcripts matched records in UniProt (range: 30-

544 40%) (Table S5 in File S1). The relatively low fraction of sequences annotated is

545 attributable in part to sequence lengths: on average, 21% of transcripts <400 bp in length

546 were annotated, as compared with 42% of transcripts 400-1000 bp in length and 78% of

547 transcripts > 1,000 bp. Even among the longest transcripts (> 1 kb), a substantial number

24

548 of sequences lacked annotated matches in UniProt (range: 6,647-12,090 sequences per

549 assembly). This highlights the well-known bias in taxonomic composition of existing

550 databases, and the value of ongoing gene sequencing in under-represented metazoan taxa

551 for public sequence databases.

552

553 To categorize the biological functions inferred from sequence similarity, Gene Ontology

554 (GO) terms were assigned to transcripts matching GO-annotated records in the UniProt

555 database. This process identified functional annotation for 77% of transcripts with

556 BLAST matches, providing tentative gene identities for a large number of sequences in

557 each assembly (range: 32,299- 47,547 transcripts; Table S5 in File S1). Figure 3 shows

558 the distribution of functional categories across the four transcriptomes, visualized using

559 the Web Gene Ontology Annotation Plotting (WEGO) application. The GO terms were

560 broadly distributed across the three domains and the percentages of sequences mapped to

561 a given sub-ontology were highly similar for all species, and comparable to other

562 invertebrate transcriptomes (Riesgo et al. 2012; O'Neil et al. 2010; Lehnert et al. 2012;

563 Moya et al. 2012; Polato et al. 2011; Shinzato et al. 2014; Stefanik et al. 2014; Traylor-

564 Knowles et al. 2011). The similarities in functional distributions of assemblies prepared

565 from diverse species, developmental stages, and symbiotic states highlights the

566 constitutive expression of a broad set of genes in cnidarian transcriptomes. These core

567 genes should facilitate comparative transcriptome studies by increasing the overlap

568 among incomplete libraries.

569

25

570 To determine taxonomic origin for each transcript, we conducted a series of BLAST

571 searches and filtering steps outlined in Figure 1. Since our assemblies were produced

572 from symbiotic and aposymbiotic specimens, the transcriptomes contain genes not only

573 from anthozoans but also from their associated microbial community. To investigate the

574 relative contributions of these sources we classified each transcript based on sequence

575 similarity (Figure 1). These analyses confirmed that metazoan sequences comprised the

576 majority of each library as expected. Fortunately, only a small fraction of transcripts were

577 derived from organelles (mitochondria and ribosomes): on average, 212 transcripts

578 (range: 127-284) in each assembly matched rRNA (N. vectensis) and 30 transcripts

579 (range: 16-54) matched the mitochondrial genome (A. tenuis). Notably, almost half of

580 transcripts in each assembly (range: 46.2% to 49.9%) lacked matches to coral or

581 Symbiodinium spp. genes, or NCBI’s nr database (Figure 4), a range that is consistent

582 with results from other anthozoan transcriptomes (Sun et al. 2013; Karako-Lampert et al.

583 2014; Polato et al. 2011; Traylor-Knowles et al. 2011). These ‘unknown’ transcripts may

584 represent lineage-specific genes (‘taxonomically-restricted genes’) that require further

585 characterization. Comparison with NCBI’s nr database revealed that the majority of

586 sequences with matches in one or more databases (59-95%) matched a metazoan

587 sequence better than any other taxon, suggesting they originated from the animal host

588 rather than from dinoflagellate or prokaryotic symbionts. A negligible fraction of

589 transcripts in each assembly (0.8-1.7%) were assigned to the “Other taxa” category, most

590 of which matched either coral or Symbiodinium genes but were classified as “unknown”

591 because of conflicting results in the nr search (e.g. transcripts that matched Symbiodinium

592 more closely than coral, but whose best matches in nr were from metazoans).

26

593

594 The contribution of algal symbionts varied widely across samples. In nominally

595 aposymbiotic samples of F. scutaria and A. elegantissma, 2.6% of transcripts on average

596 were classified as dinoflagellate in origin (Figure 4), which may have resulted either from

597 unexpected presence of symbionts at low abundance in these samples, or genes lacking

598 orthologs in the A. digitifera reference. The symbiotic samples from S. hystrix, in

599 contrast, showed comparable abundance of transcripts classified as metazoan (61,369)

600 and dinoflagellate in origin (41,724). Surprisingly, the M. cavernosa library that was

601 similarly prepared from a symbiotic sample showed only 7,278 transcripts from

602 (Figure 4). This striking contrast in Symbiodinium contributions from

603 symbiotic specimens may have arisen from differing methods of RNA extraction. For S.

604 hystrix, tissue was airbrushed off the coral skeleton directly into RNAlater® Stabilization

605 Solution (Qiagen, CA, US) followed by complete tissue homogenization. In contrast, the

606 M. cavernosa fragment was simply vortexed to disrupt tissue, without physical

607 homogenization. Our findings suggest that omitting physical homogenization during lysis

608 can minimize symbiont contamination for studies aiming to focus on the cnidarian host,

609 while studies investigating both components may benefit from thorough homogenization

610 during extraction. The gene names, functional categories, and putative origin of each

611 transcript are annotated in Tables S6-9 in Files S2-5.

612

613 Gene searches of the database

614 The resulting annotations and sequences are available in a set of searchable databases

615 hosted by Oregon State University (Meyer). To illustrate the utility of databases for

27

616 cnidarian researchers targeting specific genes, we compared the effectiveness of simple

617 text searches of the databases with reciprocal BLAST (RB) analysis, a more

618 comprehensive approach that requires additional work by the end-user. Text searches

619 targeting a handful of selected genes (cell adhesion molecule sym32, green fluorescent

620 protein GFP, and cystathionine β-synthase Cbs) produced comparable results as RB

621 searches (Table S10 in File S1). Text searches are obviously sensitive to query phrasing;

622 the query “fluorescent” retrieves 51 putative GFP homologs, and functionally related

623 synonyms (“GFP”, “chromoprotein”) retrieved an additional 10. Interestingly, the Cbs

624 homologs identified in nominally aposymbiotic samples (A. elegantissima and F.

625 scutaria) showed greater sequence similarity with Symbiodinium gene models than coral

626 (A. digitifera) and were classified as dinoflagellate in our assignment procedure (Figure

627 1), while Cbs homologs in symbiotic samples (M. cavernosa and S. hystrix) included both

628 and metazoan and dinoflagellate transcripts. This unexpected observation of apparently

629 dinoflagellate homologs of Cbs in nominally aposymbiotic samples is noteworthy

630 because of their variable distribution among corals and possible roles in coral nutritional

631 dependency on symbiosis (Shinzato et al. 2011). While this finding could be explained by

632 undetected Symbiodinium harbored in these putatively aposymbiotic samples, the

633 uncertainty introduced by these observations suggests that studies investigating the

634 diversity of Cbs homologs across corals may require additional data (e.g. in-situ

635 hybridization) to confirm transcript origins. Overall, the close agreement between

636 rigorous computational searches and simple text searches in these examples illustrates the

637 utility of our searchable online databases for rapidly identifying genes of interest in

638 reference transcriptome assemblies.

28

639

640 Novel SSR markers

641 Simple sequence repeats (SSRs, or microsatellites) have been widely used to study

642 genetic diversity, hybridization events, population structure and connectivity in

643 anthozoans (Concepcion et al. 2010; Fernandez-Silva et al. 2013; Selkoe and Toonen

644 2006; Ruiz-Ramos and Baums 2014), and can directly influence phenotypic traits by

645 altering DNA replication, translation and gene expression (Ruiz-Ramos and Baums

646 2014). SSR markers can be readily identified from de novo assemblies of NGS data, and

647 emerge as a side benefit in transcriptome assembly projects conducted for other purposes.

648 We identified and developed primers for 52, 49, 73 and 75 candidate SSR markers in A.

649 elegantissma, F. scutaria, M. cavernosa and S. hystrix, respectively. Primer pairs for each

650 species are listed in Table S11 in File S6. For three of the species studied here, varying

651 numbers of SSR markers are already available. Previous studies of S. hystrix have

652 developed 10 SSR markers (Maier et al. 2001; Underwood et al. 2006) to study habitat

653 partitioning within a single reef (Bongaerts et al. 2010), dispersal and recruitment

654 patterns across multiple reefs (van Oppen et al. 2008; Kininmonth et al. 2010), and

655 population changes associated with bleaching events (Underwood et al. 2007). Candidate

656 SSR markers have been identified in F. scutaria (n=118) from the coral host and

657 dinoflagellate symbionts (Concepcion et al. 2010). SSR markers previously developed in

658 M. cavernosa (Shearer et al. 2005; Serrano et al. 2014) have been used to investigate the

659 population connectivity across depth and geographic distance (Serrano et al. 2014). The

660 candidate SSR markers identified in this study provide additional markers for future

661 studies along similar lines. To our knowledge, SSR markers have not been previously

29

662 developed in A. elegantissima. Although the population structure of the host has not been

663 described, analysis of their dinoflagellate symbionts revealed highly structured

664 populations across their geographic range (Sanders and Palumbi 2011). The markers

665 developed in this study for A. elegantissima provide tools to investigate population

666 structure of the host across a similar range.

667

668 Orthologous groups and phylogenomic reconstructions

669 With the increasing availability of transcriptomes and genomes, these datasets can now

670 be mined to discover novel phylogenetic markers within Anthozoa and across the

671 Cnidaria to resolve taxonomic uncertainties. Phylogenetic reconstruction of anthozoans

672 has presented challenges because analyses based on morphology, life history, and

673 molecular sequences have failed to adequately delineate taxonomic boundaries or

674 evolutionary relationships (Daly et al. 2003). To date, molecular phylogenies for

675 anthozoans have been based on one or a small number of markers including nuclear

676 ribosomal 28S and 18S genes (Daly et al. 2003; Berntson et al. 1999), β-tubulin (Fukami

677 et al. 2008), mitochondrial 16S (Daly et al. 2003), cytochrome b (Fukami et al. 2008),

678 and COI (Kitahara et al. 2010; Fukami et al. 2008). Interestingly, mitochondrial

679 sequences in anthozoans have extremely low mutation rates compared to the bilaterians

680 and are therefore highly conserved, allowing for robust comparisons across distantly

681 related taxa (van Oppen et al. 2002; Galtier et al. 2009). Therefore, the mitochondrial

682 gene COI has been used recently to define evolutionary relationships among scleractinan

683 corals (Kitahara et al. 2010; Fukami et al. 2008; Budd and Stolarski 2011), and to support

684 the distinction of robust corals from the complex corals (Romano and Palumbi 1996).

30

685

686 One disadvantage to single gene phylogenetic inferences is that they suffer from weak

687 phylogenetic signals, sensitivity to hidden paralogy, and spurious tree artifacts (Philippe

688 et al. 2004). Despite these potential limitations, single gene trees have advanced the field

689 of cnidarian systematics. However, polyphyly remains a problem amongst several

690 anthozoan families when using both maximum likelihood and Bayesian analyses (Fukami

691 et al. 2008; Budd and Stolarski 2011), which has led to recent shifts in taxonomic

692 classification (Budd and Stolarski 2011). To expand beyond previous single-gene

693 approaches, we performed phylogenomic analyses incorporating the four new

694 transcriptomes and other available ‘omic’ resources. By simultaneously increasing taxon

695 and gene sampling, phylogenetic inference is expected to improve (Philippe et al. 2004)

696 and may help resolve some of the challenges in reconstructing the evolutionary

697 relationships of the Anthozoa and more broadly, phylum Cnidaria.

698

699 For phylogenomic analysis, transcripts larger than 400 bp were converted to protein with

700 TransDecoder and clustered into orthologous groups using FastOrtho. The number of

701 assigned orthologous groups ranged from 14,144 to 21,147 for the four transcriptomes

702 (Figure S1 in File S1). Comparison of all four resulted in 6,560 shared orthologs (Figure

703 S1 in File S1). The three coral species shared 2,045 orthologs not found in anemones and

704 the two most closely related corals (M. cavernosa and F. scutaria) shared 1,682 orthologs

705 absent from the other assemblies. By incorporating 11 additional taxa for phylogenomic

706 analysis (Table S2 in File S1), 443 orthologs were identified between all taxa. After

707 setting a minimum protein length (100 amino acids), these orthologs were refined using

31

708 the PhyloTreePruner analysis pipeline (Kocot et al. 2013). Filtering resulted in the

709 identification of 397 orthologs for ≥ 14 taxa. These were used to construct a

710 phylogenetic tree we termed “conservative” because loci with any missing data were

711 excluded (Table S12 in File S7).

712

713 Missing data are a commonly encountered problem in phylogenomic analyses, either

714 from either reduced transcript length or gene absence from a transcriptome (Philippe et

715 al. 2004; Kocot et al. 2013; Roure et al. 2013). However, the sensitivity of phylogenetic

716 inference to incomplete datasets is still under investigation, with mixed results from

717 phylogenomic analyses on large, but patchy supermatrices (Roure et al. 2013; Philippe et

718 al. 2004). Since the resources in this study used for ortholog identification differed in

719 completeness, ranging from EST libraries to complete genomes (Table S2 in File S1), we

720 tested the influence of missing data on our phylogenetic reconstruction. To investigate

721 this, we lowered the required number of taxa per orthologous group to ≥ 10, which

722 identified 2,897 orthologs (Table S12 in File S7). This second set was used to create the

723 “relaxed” phylogeny, so called because loci with some missing data were included.

724

725 Both maximum likelihood phylogenomic analyses reconstructed identical and strongly

726 supported topologies (bootstrap = 100; Figure S2 in File S1), demonstrating that our

727 phylogenetic inference was insensitive to missing data (Figure 5). However, the

728 relationship of the corals in the family Faviidae, containing M. cavernosa, P. strigosa and

729 O. faveolata varied among the COI, ND supergene and phylogenomic analyses. The

730 mitochondrial ND supergene identified by Havird et al. (Havird and Santos 2014)

32

731 produced a phylogenetic tree nearly synonymous with the accepted cnidarian taxonomic

732 relationships and phylogenomic analyses from this study (Kitahara et al. 2010), except

733 for the placement of the M. cavernosa as sister taxon to O. faveolata and P. strigosa. The

734 analysis of single gene COI, resulted in a discordant phylogenetic topology (Figure 5),

735 failing to reconstruct the complex coral clade (P. asteroides and A. digitifera), which was

736 recovered by ND supergene, relaxed and conserved trees (Figure 5, Figure S2 in File S1).

737 In the COI tree, the placement of the F. scutaria, from the family Fungiidae, as sister

738 taxon to P. strigosa and M. cavernosa from the family Faviidae, instead of O. faveolata is

739 incongruent with current taxonomic placement (Figure 5) (Kitahara et al. 2010; Budd and

740 Stolarski 2011). Furthermore, while the phylogenomic analyses placed O. faveolata as

741 sister to P. strigosa with strong support (bootstrap=100), this relationship was not

742 recovered in either mitochondrial phylogeny (Figure S2 in File S1). Overall, the tree

743 topology from the phylogenomic analyses is consistent with accepted evolutionary

744 relationships within Anthozoa (Budd and Stolarski 2011; Fukami et al. 2008; Kitahara et

745 al. 2010).

746

747 CONCLUSION

748 The annotated transcriptome assemblies developed in this study provide useful resources

749 for genomic research in anthozoan species for which sequences resources were

750 previously lacking. The searchable databases developed from these assemblies make it

751 possible to rapidly identify genes of interest from each species. Our ortholog analysis

752 demonstrates the feasibility of phylogenetic inference in corals using transcriptome

753 assemblies from diverse stages and symbiotic states, highlighting a promising path

33

754 toward resolving major uncertainties in the existing phylogeny of scleractinians. Future

755 studies will benefit from the growing body of anthozoan sequence resources, including

756 the four assemblies contributed in this study.

757

758 AVAILABILITY OF SUPPORTING DATA

759 The data sets supporting the results of this article are available from the Sequence Read

760 Archive at NCBI [Accession number: SRP063463], the Dryad Digital Repository

761 [doi:10.5061/dryad.3f08f], and the author’s website

762 [http://people.oregonstate.edu/~meyere/index.html].

763

764 ABBREVIATIONS

765 BLAST: Basic Local Alignment Search Tool; Cbs: cystathionine β-synthase; CEGMA:

766 Core Eukaryotic Genes Mapping Approach; COI: cytochrome oxidase subunit 1; EST:

767 expressed sequence tag; E value: expect value; GO: gene ontology; GFP: green

768 fluorescent protein; mtDNA: mitochondrial DNA; ML: maximum likelihood; NCBI:

769 National Center for Biotechnology Information; ND: NADH dehydrogenase; OHR:

770 ortholog hit ratio; NGS: next-generation sequencing; nr: non-redundant; RB: reciprocal

771 BLAST; rRNA: ribosomal RNA; RNA-Seq: RNA sequencing; SNP: single nucleotide

772 polymorphism; SSR: simple sequence repeats

773

774 COMPETING INTEREST

775 The authors declare no competing interests.

776

34

777 AUTHOR’S CONTRIBUTIONS

778 EM, SK, and AP conceived the investigation. SK, CC and AP performed library

779 preparation and sequencing. CC, AP, and EM assembled and annotated the

780 transcriptomes. SK performed computational analyses related to transcriptome

781 completeness, GO annotation, and phylogenetics. CC complied transcriptome statistics.

782 CC and SK performed targeted gene searches. EM performed cross contamination

783 screens, identified SSR markers and provided bioinformatic expertise. SK, CC and EM

784 made significant contributions to the preparation of the manuscript. All authors revised

785 and approved the final manuscript.

786

787 ACKNOWLEDGEMENTS

788 Research funding was provided by Oregon State University, Department of Integrative

789 Biology. Publication of this article in an open access journal was funded by the Oregon

790 State University Libraries & Press Open Access Fund. We would like to acknowledge

791 Dr. Christine Schnitzler, and the labs of Dr. Tung-Yung Fan and Dr. Andrew Baker for

792 assistance with sample collection. In addition, we would like to thank Sarah Guermond

793 and Emily Weiss for assistance in sample preparation and analysis.

794

795 References 796 797 Abascal, F., R. Zardoya, and D. Posada, 2005 ProtTest: selection of best-fit models of

798 protein evolution. Bioinformatics 21 (9): 2104-2105.

799 Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, 1990 Basic local

800 alignment search tool. J. Mol. Biol. 215 (3): 403-410.

35

801 Barshis, D. J., J. T. Ladner, T. A. Oliver, F. O. Seneca, N. Traylor-Knowles et al., 2013

802 Genomic basis for coral resilience to climate change. Proc. Natl. Acad. Sci. U. S.

803 A. 110 (4): 1387-1392.

804 Berntson, E. A., S. C. France, and L. S. Mullineaux, 1999 Phylogenetic relationships

805 within the class Anthozoa (phylum Cnidaria) based on nuclear 18S rDNA

806 sequences. Mol. Phylogenet. Evol. 13 (2): 417-433.

807 Bongaerts, P., C. Riginos, T. Ridgway, E. M. Sampayo, M. J. H. van Oppen et al., 2010

808 Genetic Divergence across Habitats in the Widespread Coral Seriatopora hystrix

809 and Its Associated Symbiodinium. PLoS One 5 (5): e10871.

810 Bridge, D., C. W. Cunningham, B. Schierwater, R. DeSalle, and L. W. Buss, 1992 Class-

811 level relationships in the phylum Cnidaria: evidence from mitochondrial genome

812 structure. Proc. Natl. Acad. Sci. U. S. A. 89 (18): 8750-8753.

813 Budd, A. F., and J. Stolarski, 2011 Corallite wall and septal microstructure in

814 scleractinian reef corals: comparison of molecular clades within the family

815 Faviidae. J. Morphol. 272 (1): 66-88.

816 Castresana, J., 2000 Selection of conserved blocks from multiple alignments for their use

817 in phylogenetic analysis. Mol. Biol. Evol. 17 (4): 540-552.

818 Chapman, J. A., E. F. Kirkness, O. Simakov, S. E. Hampson, T. Mitros et al., 2010 The

819 dynamic genome of Hydra. Nature 464 (7288): 592-596.

820 Closek, C. J., S. Sunagawa, M. K. DeSalvo, Y. M. Piceno, T. Z. DeSantis et al., 2014

821 Coral transcriptome and bacterial community profiles reveal distinct Yellow Band

822 Disease states in Orbicella faveolata. The ISME journal 8: 2411-2422.

36

823 Conaco, C., P. Neveu, H. Zhou, M. L. Arcila, S. M. Degnan et al., 2012 Transcriptome

824 profiling of the demosponge Amphimedon queenslandica reveals genome-wide

825 events that accompany major life cycle transitions. BMC Genomics 13 (1): 209.

826 Concepcion, G., N. Polato, I. Baums, and R. Toonen, 2010 Development of microsatellite

827 markers from four Hawaiian corals: Acropora cytherea, Fungia scutaria,

828 Montipora capitata and Porites lobata. Conserv. Genet. Resour. 2 (1): 11-15.

829 Daly, M., D. G. Fautin, and V. A. Cappola, 2003 Systematics of the Hexacorallia

830 (Cnidaria: Anthozoa). Zool. J. Linn. Soc. 139 (3): 419-437.

831 Davies, S. W., M. Rahman, E. Meyer, E. A. Green, E. Buschiazzo et al., 2013 Novel

832 polymorphic microsatellite markers for population genetics of the endangered

833 Caribbean star coral, Montastraea faveolata. Mar. Biodivers. 43 (2): 167-172.

834 Davy, S. K., D. Allemand, and V. M. Weis, 2012 Cell biology of cnidarian-dinoflagellate

835 symbiosis. Microbiol. Mol. Biol. Rev. 76 (2): 229-261.

836 DeSalvo, M., C. Voolstra, S. Sunagawa, J. Schwarz, J. Stillman et al., 2008 Differential

837 gene expression during thermal stress and bleaching in the Caribbean coral

838 Montastraea faveolata. Mol. Ecol. 17 (17): 3952-3971.

839 Douglas, A. E., 2003 Coral bleaching - how and why? Mar. Pollut. Bull. 46 (4): 385-392.

840 Dunn, C. W., A. Hejnol, D. Q. Matus, K. Pang, W. E. Browne et al., 2008 Broad

841 phylogenomic sampling improves resolution of the animal tree of life. Nature 452

842 (7188): 745-749.

843 Fernandez-Silva, I., J. Whitney, B. Wainwright, K. R. Andrews, H. Ylitalo-Ward et al.,

844 2013 Microsatellites for next-generation ecologists: a post-sequencing

845 bioinformatics pipeline. PLoS One 8 (2): e55990.

37

846 Fuchs, B., W. Wang, S. Graspeuntner, Y. Li, S. Insua et al., 2014 Regulation of -to-

847 jellyfish transition in Aurelia aurita. Curr. Biol. 24 (3): 263-273.

848 Fukami, H., C. A. Chen, A. F. Budd, A. Collins, C. Wallace et al., 2008 Mitochondrial

849 and nuclear genes suggest that stony corals are monophyletic but most families of

850 stony corals are not (Order , Class Anthozoa, Phylum Cnidaria). PLoS

851 One 3 (9): e3222.

852 Galtier, N., R. W. Jobson, B. Nabholz, S. Glémin, and P. U. Blier, 2009 Mitochondrial

853 whims: metabolic rate, longevity and the rate of molecular evolution. Biol. Lett. 5

854 (3): 413-416.

855 Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson et al., 2011

856 Trinity: reconstructing a full-length transcriptome without a genome from RNA-

857 Seq data. Nat. Biotechnol. 29 (7): 644-652.

858 Graveley, B. R., A. N. Brooks, J. W. Carlson, M. O. Duff, J. M. Landolin et al., 2011 The

859 developmental transcriptome of Drosophila melanogaster. Nature 471 (7339):

860 473-479.

861 Haas, B. J., A. Papanicolaou, M. Yassour, M. Grabherr, P. D. Blood et al., 2013 De novo

862 transcript sequence reconstruction from RNA-seq using the Trinity platform for

863 reference generation and analysis. Nat. Protoc. 8 (8): 1494-1512.

864 Hamada, M., E. Shoguchi, C. Shinzato, T. Kawashima, D. J. Miller et al., 2012 The

865 complex NOD-like receptor repertoire of the coral Acropora digitifera includes

866 novel domain combinations. Mol. Biol. Evol.: mss213.

38

867 Havird, J. C., and S. R. Santos, 2014 Performance of single and concatenated sets of

868 mitochondrial genes at inferring metazoan relationships relative to full

869 mitogenome data. PLoS One 9 (1): e84080.

870 Helm, R. R., S. Siebert, S. Tulin, J. Smith, and C. W. Dunn, 2013 Characterization of

871 differential transcript abundance through time during Nematostella vectensis

872 development. BMC Genomics 14 (1): 266.

873 Huang, X., and A. Madan, 1999 CAP3: a DNA sequence assembly program. Genome

874 Res. 9 (9): 868-877.

875 Illumina, 2014 Illumina Customer Sequence Letter. Illumina, Inc. , San Diego.

876 Karako-Lampert, S., D. Zoccola, M. Salmon-Divon, M. Katzenellenbogen, S. Tambutté

877 et al., 2014 Transcriptome analysis of the scleractinian coral Stylophora pistillata.

878 PLoS One 9 (2): e88615.

879 Katoh, K., K. Misawa, K. i. Kuma, and T. Miyata, 2002 MAFFT: a novel method for

880 rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids

881 Res. 30 (14): 3059-3066.

882 Kayal, E., B. Roure, H. Philippe, A. G. Collins, and D. V. Lavrov, 2013 Cnidarian

883 phylogenetic relationships as revealed by mitogenomics. BMC Evol. Biol. 13 (1):

884 5.

885 Kenkel, C., E. Meyer, and M. Matz, 2013 Gene expression under chronic heat stress in

886 populations of the mustard hill coral (Porites astreoides) from different thermal

887 environments. Mol. Ecol. 22 (16): 4322-4334.

39

888 Kininmonth, S., M. J. H. van Oppen, and H. P. Possingham, 2010 Determining the

889 community structure of the coral Seriatopora hystrix from hydrodynamic and

890 genetic networks. Ecol. Modell. 221 (24): 2870-2880.

891 Kitahara, M. V., S. D. Cairns, J. Stolarski, D. Blair, and D. J. Miller, 2010 A

892 comprehensive phylogenetic analysis of the Scleractinia (Cnidaria, Anthozoa)

893 based on mitochondrial CO1 sequence data. PLoS One 5 (7): e11490.

894 Kocot, K. M., J. T. Cannon, C. Todt, M. R. Citarella, A. B. Kohn et al., 2011

895 Phylogenomics reveals deep molluscan relationships. Nature 477 (7365): 452-

896 456.

897 Kocot, K. M., M. R. Citarella, L. L. Moroz, and K. M. Halanych, 2013 PhyloTreePruner:

898 a phylogenetic tree-based approach for selection of orthologous sequences for

899 phylogenomics. Evol. Bioinform. Online 9: 429-435.

900 Kück, P., and K. Meusemann, 2010 FASconCAT: Convenient handling of data matrices.

901 Mol. Phylogenet. Evol. 56 (3): 1115-1118.

902 Lehnert, E. M., M. S. Burriesci, and J. R. Pringle, 2012 Developing the anemone Aiptasia

903 as a tractable model for cnidarian-dinoflagellate symbiosis: the transcriptome of

904 aposymbiotic A. pallida. BMC Genomics 13 (1): 271.

905 Lehnert, E. M., M. E. Mouchka, M. S. Burriesci, N. D. Gallo, J. A. Schwarz et al., 2014

906 Extensive differences in gene expression between symbiotic and aposymbiotic

907 Cnidarians. G3: Genes| Genomes| Genetics 4 (2): 277-295.

908 Li, L., C. J. Stoeckert, and D. S. Roos, 2003 OrthoMCL: Identification of ortholog groups

909 for eukaryotic genomes. Genome Res. 13 (9): 2178-2189.

40

910 Maddison, W. P., and D. R. Maddison, 2011 Mesquite: a modular system for

911 evolutionary analysis.

912 Maier, E., R. Tollrian, and B. Nürnberger, 2001 Development of species-specific markers

913 in an organism with endosymbionts: microsatellites in the scleractinian coral

914 Seriatopora hystrix. Mol. Ecol. Notes 1 (3): 157-159.

915 Marlow, H. Q., M. Srivastava, D. Q. Matus, D. Rokhsar, and M. Q. Martindale, 2009

916 Anatomy and development of the nervous system of Nematostella vectensis, an

917 anthozoan cnidarian. Dev. Neurobiol. 69 (4): 235-254.

918 Mazel, C. H., M. P. Lesser, M. Y. Gorbunov, T. M. Barry, J. H. Farrell et al., 2003

919 Green-fluorescent proteins in Caribbean corals. Limnol. Oceanogr. 48 (1): 402-

920 411.

921 Meyer, E., Meyer Laboratory Website. http://people.oregonstate.edu/~meyere/index.html

922 Meyer, E., G. V. Aglyamova, and M. V. Matz, 2011 Profiling gene expression responses

923 of coral larvae (Acropora millepora) to elevated temperature and settlement

924 inducers using a novel RNA-Seq procedure. Mol. Ecol. 20 (17): 3599-3616.

925 Meyer, E., G. V. Aglyamova, S. Wang, J. Buchanan-Carter, D. Abrego et al., 2009

926 Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx.

927 BMC Genomics 10 (1): 219.

928 Meyer, E., and V. M. Weis, 2012 Study of cnidarian-algal symbiosis in the “omics” age.

929 Biol. Bull. 223 (1): 44-65.

930 Moriya, Y., M. Itoh, S. Okuda, A. C. Yoshizawa, and M. Kanehisa, 2007 KAAS: an

931 automatic genome annotation and pathway reconstruction server. Nucleic Acids

932 Res. 35 (suppl 2): W182-W185.

41

933 Moya, A., L. Huisman, E. Ball, D. Hayward, L. Grasso et al., 2012 Whole transcriptome

934 analysis of the coral Acropora millepora reveals complex responses to

935 CO2‐driven acidification during the initiation of calcification. Mol. Ecol. 21 (10):

936 2440-2454.

937 Nakasugi, K., R. N. Crowhurst, J. Bally, C. C. Wood, R. P. Hellens et al., 2013 De Novo

938 transcriptome sequence assembly and analysis of RNA silencing genes of

939 Nicotiana benthamiana. PLoS One 8 (3): e59534.

940 O'Neil, S., J. Dzurisin, R. Carmichael, N. Lobo, S. Emrich et al., 2010 Population-level

941 transcriptome sequencing of nonmodel organisms Erynnis propertius and Papilio

942 zelicaon. BMC Genomics 11 (1): 310.

943 O'Neil, S., and S. Emrich, 2013 Assessing De Novo transcriptome assembly metrics for

944 consistency and utility. BMC Genomics 14 (1): 465.

945 Parra, G., K. Bradnam, and I. Korf, 2007 CEGMA: a pipeline to accurately annotate core

946 genes in eukaryotic genomes. Bioinformatics 23 (9): 1061-1067.

947 Philippe, H., E. A. Snell, E. Bapteste, P. Lopez, P. W. Holland et al., 2004

948 Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol.

949 Biol. Evol. 21 (9): 1740-1752.

950 Polato, N. R., J. C. Vera, and I. B. Baums, 2011 Gene discovery in the threatened elkhorn

951 coral: 454 sequencing of the Acropora palmata transcriptome. PLoS One 6 (12):

952 e28634.

953 Poole, A. Z., and V. M. Weis, 2014 TIR-domain-containing protein repertoire of nine

954 anthozoan species reveals coral–specific expansions and uncharacterized proteins.

955 Dev. Comp. Immunol. 46 (2): 480-488.

42

956 Pratlong, M., A. Haguenauer, O. Chabrol, C. Klopp, P. Pontarotti et al., 2015 The red

957 coral (Corallium rubrum) transcriptome: a new resource for population genetics

958 and local adaptation studies. Mol. Ecol. Resour. 15: 1205-1215.

959 Price, M. N., P. S. Dehal, and A. P. Arkin, 2010 FastTree 2 – approximately Maximum-

960 Likelihood trees for large alignments. PLoS One 5 (3): e9490.

961 Putnam, N. H., M. Srivastava, U. Hellsten, B. Dirks, J. Chapman et al., 2007 Sea

962 anemone genome reveals ancestral eumetazoan gene repertoire and genomic

963 organization. Science 317 (5834): 86-94.

964 Quast, C., E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer et al., 2012 The SILVA

965 ribosomal RNA gene database project: improved data processing and web-based

966 tools. Nucleic Acids Res.: gks1219.

967 Reynolds, W. S., J. A. Schwarz, and V. M. Weis, 2000 Symbiosis-enhanced gene

968 expression in cnidarian-algal associations: cloning and characterization of a

969 cDNA, sym32, encoding a possible cell adhesion protein. Comp. Biochem. 126

970 (1): 33-44.

971 Riesgo, A., S. C. Andrade, P. Sharma, M. Novo, A. Perez-Porro et al., 2012 Comparative

972 description of ten transcriptomes of newly sequenced invertebrates and efficiency

973 estimation of genomic sampling in non-model taxa. Front. Zoology 9 (1): 33.

974 Riesgo, A., N. Farrar, P. J. Windsor, G. Giribet, and S. P. Leys, 2014 The analysis of

975 eight transcriptomes from all poriferan classes reveals surprising genetic

976 complexity in sponges. Mol. Biol. Evol. 31 (5): 1102-1120.

977 Romano, S. L., and S. R. Palumbi, 1996 Evolution of scleractinian corals inferred from

978 molecular systematics. Science 271 (5249): 640-642.

43

979 Rota-Stabelli, O., Z. Yang, and M. J. Telford, 2009 MtZoa: A general mitochondrial

980 amino acid substitutions model for animal evolutionary studies. Mol. Phylogenet.

981 Evol. 52 (1): 268-272.

982 Roure, B., D. Baurain, and H. Philippe, 2013 Impact of missing data on phylogenies

983 inferred from empirical phylogenomic data sets. Mol. Biol. Evol. 30 (1): 197-214.

984 Rozen, S., and H. Skaletsky, 1999 Primer3 on the WWW for general users and for

985 biologist programmers, pp. 365-386 in Bioinformatics methods and protocols.

986 Springer.

987 Ruiz-Ramos, D., and I. Baums, 2014 Microsatellite abundance across the Anthozoa and

988 Hydrozoa in the phylum Cnidaria. BMC Genomics 15 (1): 939.

989 Ryan, J. F., P. M. Burton, M. E. Mazza, G. K. Kwong, J. C. Mullikin et al., 2006 The

990 cnidarian-bilaterian ancestor possessed at least 56 homeoboxes: evidence from the

991 starlet sea anemone, Nematostella vectensis. Genome Biol. 7 (7): R64.

992 Ryan, J. F., K. Pang, C. E. Schnitzler, A.-D. Nguyen, R. T. Moreland et al., 2013 The

993 genome of the ctenophore Mnemiopsis leidyi and its implications for cell type

994 evolution. Science 342 (6164).

995 Sanders, J. G., and S. R. Palumbi, 2011 Populations of Symbiodinium muscatinei show

996 strong biogeographic structuring in the intertidal anemone Anthopleura

997 elegantissima. Biol. Bull. 220 (3): 199-208.

998 Sanders, S., M. Shcheglovitova, and P. Cartwright, 2014 Differential gene expression

999 between functionally specialized polyps of the colonial hydrozoan Hydractinia

1000 symbiolongicarpus (Phylum Cnidaria). BMC Genomics 15 (1): 406.

44

1001 Schnitzler, C. E., and V. M. Weis, 2010 Coral larvae exhibit few measurable

1002 transcriptional changes during the onset of coral-dinoflagellate endosymbiosis.

1003 Mar. Genomics 3 (2): 107-116.

1004 Selkoe, K. A., and R. J. Toonen, 2006 Microsatellites for ecologists: a practical guide to

1005 using and evaluating microsatellite markers. Ecol. Lett. 9 (5): 615-629.

1006 Serrano, X., I. B. Baums, K. O'Reilly, T. B. Smith, R. J. Jones et al., 2014 Geographic

1007 differences in vertical connectivity in the Caribbean coral Montastraea cavernosa

1008 despite high levels of horizontal connectivity at shallow depths. Mol. Ecol. 23

1009 (17): 4226-4240.

1010 Shearer, T. L., C. Gutiérrez-Rodríguez, and M. A. Coffroth, 2005 Generating molecular

1011 markers from zooxanthellate cnidarians. Coral Reefs 24 (1): 57-66.

1012 Shinzato, C., M. Hamada, E. Shoguchi, T. Kawashima, and N. Satoh, 2012a The

1013 repertoire of chemical defense genes in the coral Acropora digitifera genome.

1014 Zoolog. Sci. 29 (8): 510-517.

1015 Shinzato, C., M. Inoue, and M. Kusakabe, 2014 A snapshot of a coral “holobiont”: a

1016 transcriptome assembly of the scleractinian coral, Porites, captures a wide variety

1017 of genes from both the host and symbiotic zooxanthellae. PLoS One 9 (1):

1018 e85182.

1019 Shinzato, C., E. Shoguchi, T. Kawashima, M. Hamada, K. Hisata et al., 2011 Using the

1020 Acropora digitifera genome to understand coral responses to environmental

1021 change. Nature 476 (7360): 320-323.

1022 Shinzato, C., E. Shoguchi, M. Tanaka, and N. Satoh, 2012b Fluorescent protein candidate

1023 genes in the coral Acropora digitifera genome. Zoolog. Sci. 29 (4): 260-264.

45

1024 Siebert, S., M. D. Robinson, S. C. Tintori, F. Goetz, R. R. Helm et al., 2011 Differential

1025 gene expression in the siphonophore Nanomia bijuga (Cnidaria) assessed with

1026 multiple next-generation sequencing workflows. PLoS One 6 (7): e22953.

1027 Smit, A., R. Hubley, and P. Green, RepeatMasker Open-3.0.

1028 http://www.repeatmasker.org

1029 Smith-Keune, C., and S. Dove, 2008 Gene expression of a green fluorescent protein

1030 homolog as a host-specific biomarker of heat stress within a reef-building coral.

1031 Mar. Biotechnol. (N. Y.) 10 (2): 166-180.

1032 Soza‐Ried, J., A. Hotz‐Wagenblatt, K. H. Glatting, C. del Val, K. Fellenberg et al., 2010

1033 The transcriptome of the colonial marine hydroid Hydractinia echinata. FEBS

1034 Journal 277 (1): 197-209.

1035 Srivastava, M., O. Simakov, J. Chapman, B. Fahey, M. E. A. Gauthier et al., 2010 The

1036 Amphimedon queenslandica genome and the evolution of animal complexity.

1037 Nature 466 (7307): 720-726.

1038 Stamatakis, A., 2014 RAxML Version 8: A tool for phylogenetic analysis and post-

1039 analysis of large phylogenies. Bioinformatics 30 (9): 1312-1313.

1040 Stefanik, D. J., T. J. Lubinski, B. R. Granger, A. L. Byrd, A. M. Reitzel et al., 2014

1041 Production of a reference transcriptome and transcriptomic database

1042 (EdwardsiellaBase) for the lined sea anemone, Edwardsiella lineata, a parasitic

1043 cnidarian. BMC Genomics 15 (1): 71.

1044 Sun, J., Q. Chen, J. C. Lun, J. Xu, and J.-W. Qiu, 2013 PcarnBase: Development of a

1045 transcriptomic database for the brain coral Platygyra carnosus. Mar. Biotechnol.

1046 (N. Y.) 15 (2): 244-251.

46

1047 The Gene Ontology, C., M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein et al., 2000

1048 Gene Ontology: tool for the unification of biology. Nat. Genet. 25 (1): 25-29.

1049 Traylor-Knowles, N., B. R. Granger, T. J. Lubinski, J. R. Parikh, S. Garamszegi et al.,

1050 2011 Production of a reference transcriptome and transcriptomic database

1051 (PocilloporaBase) for the cauliflower coral, Pocillopora damicornis. BMC

1052 Genomics 12 (1): 585.

1053 Underwood, J. N., L. D. Smith, M. J. H. Van Oppen, and J. P. Gilmour, 2007 Multiple

1054 scales of genetic connectivity in a brooding coral on isolated reefs following

1055 catastrophic bleaching. Mol. Ecol. 16 (4): 771-784.

1056 Underwood, J. N., P. B. Souter, E. R. Ballment, A. H. Lutz, and M. J. H. Van Oppen,

1057 2006 Development of 10 polymorphic microsatellite markers from herbicide-

1058 bleached tissues of the brooding pocilloporid coral Seriatopora hystrix. Mol. Ecol.

1059 Notes 6 (1): 176-178.

1060 Van Dongen, S., 2000 Graph clustering by flow simulation. University of Utrecht, The

1061 Netherlands.

1062 van Oppen, M. J. H., J. Catmull, B. J. McDonald, N. R. Hislop, P. J. Hagerman et al.,

1063 2002 The mitochondrial genome of Acropora tenuis (Cnidaria; Scleractinia)

1064 contains a large group I intron and a candidate control region. J. Mol. Evol. 55

1065 (1): 1-13.

1066 van Oppen, M. J. H., A. Lutz, G. De'ath, L. Peplow, and S. Kininmonth, 2008 Genetic

1067 traces of recent long-distance dispersal in a predominantly self-recruiting coral.

1068 PLoS One 3 (10): e3401.

47

1069 Vidal-Dupiol, J., D. Zoccola, E. Tambutté, C. Grunau, C. Cosseau et al., 2013 Genes

1070 Related to Ion-Transport and Energy Production Are Upregulated in Response to

1071 CO2-Driven pH Decrease in Corals: New Insights from Transcriptome Analysis.

1072 PloS one 8 (3): e58652.

1073 Weis, V. M., 2008 Cellular mechanisms of Cnidarian bleaching: stress causes the

1074 collapse of symbiosis. J. Exp. Biol. 211 (19): 3059-3066.

1075 Weis, V. M., and D. Allemand, 2009 What determines coral health? Science 324 (5931):

1076 1153-1155.

1077 Wenger, Y., and B. Galliot, 2013 RNAseq versus genome-predicted transcriptomes: a

1078 large population of novel transcripts identified in an Illumina-454 Hydra

1079 transcriptome. BMC Genomics 14 (1): 204.

1080 Whelan, S., and N. Goldman, 2001 A general empirical model of protein evolution

1081 derived from multiple protein families using a Maximum-Likelihood approach.

1082 Mol. Biol. Evol. 18 (5): 691-699.

1083 Wood-Charlson, E. M., and V. M. Weis, 2009 The diversity of C-type lectins in the

1084 genome of a basal metazoan, Nematostella vectensis. Dev. Comp. Immunol. 33

1085 (8): 881-889.

1086

1087 FIGURE LEGENDS

1088 Figure 1. Annotation pipeline used to classify origins of each assembled transcript.

1089 A series of sequence comparisons were performed, comparing each transcript against N.

1090 vectensis rRNA [SILVA: ABAV01023297, ABAV01023333] from A. tenuis

1091 mitochondrial DNA [NCBI: NC_003522.1], A. digitifera and S. minutum gene models,

48

1092 and the NCBI non-redundant protein database (bit-score threshold of 45 for small

1093 databases; e-value threshold of 10-5 for large databases). Transcripts were assigned to

1094 categories by evaluating their similarity to each database in the order shown (see

1095 Methods for details).

1096

1097 Figure 2. Three metrics used to evaluate gene representation and assembly of

1098 complete transcripts in de novo transcriptome assemblies. (a) Percent of core

1099 eukaryotic genes (CEGMA) identified in each assembly; (b) percent of A. digitifera gene

1100 models with significant matches in each assembly; (c) median proportion of each N.

1101 vectensis proteins aligned with transcripts in each assembly (OHRhits). Grey = our

1102 transcriptome assembly compared to the respective reference for each analysis.

1103

1104 Figure 3. Distribution of functional categories (GO terms) in each transcriptome

1105 assembly. The percentage of transcripts with GO annotation for each category under the

1106 three main ontology domains was calculated for each assembly.

1107

1108 Figure 4. Predicted taxonomic origin of transcriptomes based on homology searches

1109 with BLAST. The percent of transcripts that were assigned to rRNA (purple), mtDNA

1110 (blue), dinoflagellate (green), metazoan (pink), other taxa (orange) and no match (grey)

1111 are shown.

1112

1113 Figure 5. Discordance in maximum likelihood phylogenetic reconstruction of COI

1114 compared to a combined phylogeny of concatenated ND (2, 4 and 5) genes and two

49

1115 phylogenomic trees. The COI phylogeny is presented on the left and the combined

1116 phylogeny is presented on the right. Topology for the ND mitochondrial set, relaxed and

1117 conservative phylogenomic trees were nearly identical. Therefore, nodal support is

1118 summarized on the relaxed tree (right). Bootstrap support at the nodes from left to right

1119 represents ND gene set/relaxed/conservative. If topologies differed in the summary tree,

1120 the nodal support is presented as -- next to the node. Yellow solid lines connect taxon

1121 with different positions and/or relationships between the two trees, while black dashed

1122 lines connect those with the same position and/or relationship. Reconstruction of groups

1123 in the class Anthozoa based on Kitahara et al. (Kitahara et al. 2010) are highlighted in

1124 boxes: teal= robust corals, pink = complex corals, and light blue = anemones. The names

1125 of species used in this study are emphasized by bold font. Scale bars indicate the amino

1126 acid replacements per site.

1127

1128 Figure S1 in File S1.doc. Venn diagram of shared orthologous groups. Comparision

1129 of the orthologous groups identified with FastOrtho from the four transcriptomes in this

1130 study. Total orthologous groups for each transcriptome are in parenthetical notation under

1131 the species name. S. hystrix and M. cavernosa shared the most orthologs (3,900) followed

1132 by F. scutaria and M. cavernosa (1,682).

1133

1134 Figure S2 in File S1.doc. Individual maximum likelihood trees from COI,

1135 concatenated ND genes, relaxed and conservative taxon sampling across the whole

1136 transcriptomes and genomes. The optimal COI (A), ND genes (B), relaxed (C) and

1137 conservative (D) phylogenies are presented with nodal support from 500 bootstrap

50

1138 replicates, except for the relaxed with 100 bootstrap replicates. The four transcriptomes

1139 from this study are highlighted by bold font. The scale bar beneath each tree indicates the

1140 amino acid substitutions per site.

1141

1142 TABLES

1143 Table 1. Collection sites, life history stages and symbiotic states of the four anthozoans

1144 used for transcriptome assembly.

Developmental Organism Collection Site Symbiotic State Stage Anthopleura elegantissima Seal Rock, OR Adult Aposymbiotic Fungia scutaria Coconut Island, HI Larval Aposymbiotic Montastraea cavernosa Florida Keys, FL Adult Symbiotic Seriatopora hystrix Nanwan Bay, Taiwan Adult Symbiotic 1145

1146 Table S1 in File S1.doc. Oligonucleotide primers used in sample preparation for

1147 Illumina sequencing.

1148

1149 Table S2 in File S1.doc. Genomic and transcriptomic datasets used for ortholog

1150 identification and phylogenetic analyses.

1151

1152 Table S3 in File S1.doc. Cytochrome oxidase subunit I (COI) sequences used in the

1153 phylogenetic analysis.

1154

1155 Table S4 in File S1.doc. Supergene set of NADH dehydrogenase transcripts used in the

1156 phylogenetic analysis.

1157

51

1158 Table S5 in File S1.doc. General transcriptome assembly and annotation statistics before

1159 and after a minimum transcript length was set to 400bp.

1160

1161 Table S6 in File S2.xls. Compiled annotation for A. elegantissma transcriptome

1162 including transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA,

1163 mitochondrial DNA or taxa origin from local and NCBI database searches.

1164

1165 Table S7 in File S3.xl. Compiled annotation for F. scutaria transcriptome including

1166 transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA, mitochondrial

1167 DNA or taxa origin from local and NCBI database searches.

1168

1169 Table S8 in File S4.xls. Compiled annotation for M. cavernosa transcriptome including

1170 transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA, mitochondrial

1171 DNA or taxa origin from local and NCBI database searches.

1172

1173 Table S9 in File S5.xls. Compiled annotation for S. hystrix transcriptome including

1174 transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA, mitochondrial

1175 DNA or taxa origin from local and NCBI database searches.

1176

1177 Table S10 in File S1.doc. Comparison of gene searches by name search and reciprocal

1178 BLAST. Bit-score cutoffs were set to 45 and taxonomic annotations were designated

1179 based on our taxonomic screen (Figure 1).

1180

52

1181 Table S11 in File S6.xls. Primers designed for potential SSR markers from each species

1182 in this study.

1183

1184 Table S12 in File S7.xls. Orthologs used in relaxed (≥ 10 taxa) and conservative (≥ 14

1185 taxa) phylogenomic analyses.

1186

53

1187 Figure 1.

1188

54

1189 Figure 2.

1190

1191

55

1192 Figure 3.

1193

1194

56

1195 Figure 4.

1196

1197

57

1198 Figure 5.

1199

58