G3: Genes|Genomes|Genetics Early Online, published on September 17, 2015 as doi:10.1534/g3.115.020164
1 De novo assembly and characterization of four anthozoan (phylum
2 Cnidaria) transcriptomes
3
4 Sheila A. Kitchen * ^
5 Email: [email protected]
6 Camerron M. Crowder ^
7 Email: [email protected]
8 Angela Z. Poole
9 Email: [email protected]
10 Virginia M. Weis
11 Email: [email protected]
12 Eli Meyer
13 Email: [email protected]
14
15 Department of Integrative Biology, Oregon State University, 3029 Cordley Hall,
16 Corvallis, OR 97331, USA
17
18 Equal Contributors
19
20 Accession Numbers:
21 Raw data: NCBI’s SRA, accession SRP063463
22 Assemblies: DRYAD digital repository, doi:10.5061/dryad.3f08f
23
1
© The Author(s) 2013. Published by the Genetics Society of America. 24 Running Title: Four Anthozoan Transcriptomes
25
26 Keywords: coral, phylogenomics, non-model system, database
27
28 Corresponding Author:
29 Sheila Kitchen
30 Department of Integrative Biology
31 3029 Cordley Hall
32 Corvallis, OR 97330
33 USA
34
35 Phone: 703-673-6292
36 Email: [email protected]
37
38
39
40
41
42
43
44
2
45 ABSTRACT
46 Many non-model species exemplify important biological questions but lack the sequence
47 resources required to study the genes and genomic regions underlying traits of interest.
48 Reef-building corals are famously sensitive to rising seawater temperatures, motivating
49 ongoing research into their stress responses and long-term prospects in a changing
50 climate. A comprehensive understanding of these processes will require extending
51 beyond the sequenced coral genome (Acropora digitifera) to encompass diverse coral
52 species and related anthozoans. Toward that end, we have assembled and annotated
53 reference transcriptomes to develop catalogs of gene sequences for three scleractinian
54 corals (Fungia scutaria, Montastraea cavernosa, Seriatopora hystrix) and a temperate
55 anemone (Anthopleura elegantissima). High-throughput sequencing of cDNA libraries
56 produced ~20-30 million reads per sample, and de novo assembly of these reads produced
57 ~75-110 thousand transcripts from each sample with size distributions (mean ~ 1.4 kb,
58 N50~ 2 kb) comparable to the distribution of gene models from the coral genome (mean
59 ~1.7 kb, N50 ~ 2.2 kb). Each assembly includes matches for more than half the gene
60 models from A. digitifera (54-67%), and many reasonably complete transcripts (~5,300-
61 6,700) spanning nearly the entire gene (ortholog hit ratios ≥ 0.75). The catalogs of gene
62 sequences developed in this study made it possible to identify hundreds to thousands of
63 orthologs across diverse scleractinian species and related taxa. We used these sequences
64 for phylogenetic inference, recovering known relationships and demonstrating superior
65 performance over phylogenetic trees constructed using single mitochondrial loci. The
66 resources developed in this study provide gene sequences and genetic markers for several
67 anthozoan species. To enhance the utility of these resources for the research community,
3
68 we developed searchable databases enabling researchers to rapidly recover sequences for
69 genes of interest. Our analysis of de novo assembly quality highlights metrics that we
70 expect will be useful for evaluating the relative quality of other de novo transcriptome
71 assemblies. The identification of orthologous sequences and phylogenetic reconstruction
72 demonstrates the feasibility of these methods for clarifying the substantial uncertainties in
73 the existing scleractinian phylogeny.
74
75 INTRODUCTION
76 Transcriptome sequencing provides a rapid and cost-effective approach for gene
77 discovery in non-model organisms. Analysis of transcriptomes from a diverse range of
78 invertebrates such as sponges (Riesgo et al. 2014; Conaco et al. 2012), ctenophores (Ryan
79 et al. 2013), annelids (Riesgo et al. 2012), and molluscs (Riesgo et al. 2012; Kocot et al.
80 2011) has enhanced comparative and evolutionary studies of metazoans. Quantitative
81 analysis of these sequences (RNA-Seq) has become the method of choice to profile
82 genome-wide transcription levels. This technique provides an unbiased approach to
83 discovering functional processes through identification and quantification of
84 differentially expressed genes between phenotypic states including experimental
85 treatments (Meyer et al. 2011), tissue types (Siebert et al. 2011), and developmental
86 stages (Graveley et al. 2011).
87
88 Genomic and transcriptomic resources have been developed for a variety of species
89 within the phylum Cnidaria (Moya et al. 2012; Barshis et al. 2013; Fuchs et al. 2014;
90 Helm et al. 2013; Lehnert et al. 2012; Meyer et al. 2011; Meyer et al. 2009; Polato et al.
4
91 2011; Shinzato et al. 2014; Soza‐Ried et al. 2010; Traylor-Knowles et al. 2011; Wenger
92 and Galliot 2013; Sun et al. 2013; Meyer and Weis 2012; Lehnert et al. 2014), a diverse
93 group of evolutionarily and ecologically significant species that range from hydroids
94 (Class Hydrozoa) and jellyfish (Class Medusozoa) to sea anemones and corals (Class
95 Anthozoa). Cnidarians are among early-diverging or basal metazoans and occupy a key
96 position as a sister taxon to the bilaterians (Dunn et al. 2008). Many cnidarians play an
97 important role in marine trophic cascades, due to their mutualistic relationship with
98 dinoflagellate species of the genus Symbiodinium that reside inside of cnidarian host
99 cells. This relationship is based on nutritional exchange in which Symbiodinium spp.
100 provide the cnidarian host with products from photosynthesis in return for inorganic
101 nutrients and a stable, high-light environment (Davy et al. 2012). The paramount
102 examples of this partnership are the reef-building corals, which form the trophic and
103 structural foundation of productive and biodiverse coral reef ecosystems. Anthropogenic
104 stressors, especially those associated with global climate change, are gravely threatening
105 these reef ecosystems, including the corals themselves (Douglas 2003; Weis and
106 Allemand 2009). Insight into the molecular mechanisms that underlie coral-dinoflagellate
107 symbioses and their stress response to environmental perturbation is critical for future
108 management and conservation of coral reef ecosystems.
109
110 To date, there are two publically available, sequenced genomes from the Anthozoa: the
111 symbiotic coral Acropora digitifera (Shinzato et al. 2011) and the non-symbiotic sea
112 anemone, Nematostella vectensis (Putnam et al. 2007). These genomes have provided
113 insight into the genomic complexity of cnidarians, furthering studies of gene evolution
5
114 and function across basal metazoans (Poole and Weis 2014; Putnam et al. 2007; Shinzato
115 et al. 2011; Marlow et al. 2009; Ryan et al. 2006; Hamada et al. 2012; Shinzato et al.
116 2012a; Shinzato et al. 2012b; Wood-Charlson and Weis 2009; Dunn et al. 2008).
117 Comparison of these genomes has revealed putative symbiosis-associated genes that may
118 function in the onset and maintenance of cnidarian-dinoflagellate symbiosis (Meyer and
119 Weis 2012). Annotated de novo transcriptomes, generated using NGS (expressed
120 sequence tags (ESTs), 454 pyrosequencing and Illumina HiSeq technologies), have been
121 published for 8 genera of anthozoans (Polato et al. 2011; Kenkel et al. 2013; Meyer et al.
122 2009; Traylor-Knowles et al. 2011; Lehnert et al. 2012; Pratlong et al. 2015; Shinzato et
123 al. 2014; Vidal-Dupiol et al. 2013). These resources have been used in variety of
124 contexts including the study of gene family evolution (Poole and Weis 2014), symbiosis-
125 enhanced gene expression (Lehnert et al. 2014) and responses to environmental stressors
126 such as elevated seawater temperature (Meyer et al. 2011; Kenkel et al. 2013), bacterial
127 infection (Closek et al. 2014), and CO2-driven changes in seawater pH (Vidal-Dupiol et
128 al. 2013). These studies are adding to earlier generation omics studies (EST studies,
129 subtractive hybridization and cDNA microarrays (Meyer and Weis 2012)) and are
130 providing information on the mechanisms of cnidarian-dinoflagellate symbiosis and coral
131 bleaching, a stress response that results from the breakdown of the partnership (Weis
132 2008; Davy et al. 2012). Expression studies are therefore contributing not only to our
133 basic understanding of cellular processes in cnidarians, but also to our ability to link
134 molecular responses with phenotypic change due to environmental perturbation.
135
6
136 The available anthozoan resources are limited in taxonomic diversity, and dominated by a
137 few genera from a narrow geographic range (Meyer and Weis 2012). In addition, many
138 resources are from aposymbiotic (lacking dinoflagellate symbionts) samples or non-
139 symbiotic species, which limits the study of interplay between the two partners. One goal
140 of this work is to increase the number and diversity of anthozoan resources for
141 comparative, phylogenetic and functional analyses.
142
143 In this study, we present transcriptomes from four anthozoans: the sea anemone
144 Anthopleura elegantissima (Brandt, 1835) and the corals Fungia scutaria (Lamarck,
145 1801), Montastraea cavernosa (Linnaeus, 1767), and Seriatopora hystrix (Dana, 1846) in
146 varying symbiotic states, life history stages and geographic locations (Table 1). These
147 species are of particular interest to investigations into the molecular mechanisms
148 associated with the onset, maintenance and breakdown of cnidarian-dinoflagellate
149 symbioses. We highlight how these transcriptomes can be used in applications ranging
150 from targeted gene searches to orthologous group predictions and phylogenomic analysis.
151 In addition, we outline a method to screen for cross-contamination between sequencing
152 libraries that can be broadly applied to other transcriptome studies.
153
154 MATERIALS AND METHODS
155 Sample collection and RNA extraction
156 All four anthozoan species examined in this study engage in symbiosis with
157 Symbiodinium spp., and therefore RNA extractions typically include contributions from
158 the dinoflagellate symbionts at some level. Here, two samples (M. cavernosa and S.
7
159 hystrix) were collected from symbiotic specimens and two samples (F. scutaria and A.
160 elegantissima) were collected from nominally aposymbiotic stages or specimens (Table
161 1). Larvae of F. scutaria were reared in filtered seawater at Hawaii Institute of Marine
162 Biology following fertilization and development, and remained symbiont-free during
163 development (Schnitzler and Weis 2010). The aposymbiotic specimen of A.
164 elegantissima was collected in that condition in the field.
165
166 Total RNA was extracted from S. hystrix, F. scutaria, and A. elegantissma using the
167 following methods. S. hystrix tissue was stored in RNAlater® Stabilization Solution
168 (Qiagen, CA, US) and RNA was extracted using the RNeasy Mini Kit (Qiagen, CA, US)
169 according to the manufacturer’s protocol. Whole animal specimens of A. elegantissima
170 (aposymbiotic) and F. scutaria (larvae) were collected, frozen in liquid nitrogen and
171 stored at -80. RNA was extracted using a combination of the TRIzol® RNA isolation
172 protocol (Life Technologies, CA, US) and RNeasy Mini Kit (Qiagen, CA, US). The
173 TRIzol® protocol was used for initial steps up to and including the chloroform
174 extraction. Following tissue homogenization, an additional centrifugation step was
175 performed at 12,000 x g for 10 min to remove tissue debris. After the chloroform
176 extraction, the aqueous layer was combined with equal volume of 100% EtOH and the
177 RNeasy Mini Kit was used to perform washes following the manufacturer’s protocol.
178
179 A core sample of M. cavernosa was collected, frozen in liquid nitrogen and stored at -
180 80. Total RNA from M. cavernosa was extracted following a modified TRIzol®
181 protocol with a 12 M LiCl precipitation (Mazel et al. 2003). Briefly, the coral fragment
8
182 was vortexed in TRIzol® reagent for 15 min and then processed according to the
183 manufacturer’s instructions through phase separation. To precipitate RNA, 0.25 ml of
184 isopropanol and 0.25 ml of a high salt solution (0.8 M sodium citrate and 1.2 M NaCl)
185 per 1 ml of TRIzol® used was added to the aqueous supernatant. The addition of high
186 salt solution removes proteoglycan and polysaccharide contaminates. The solution was
187 incubated at room temperature for 10 min and then centrifuged at 12,000 x g for 10 min
188 at 4°. After centrifugation, the standard TRIzol® protocol was followed through the
189 ethanol wash. To remove PCR inhibitors of an unknown nature that are frequently
190 encountered in coral samples, RNA was precipitated by adding an equal volume of 12 M
191 LiCl, then incubated for 30 min at -20°. The sample was centrifuged at 12,000 x g for 15
192 min at room temperature and washed with 75% EtOH (1 ml per 1 ml of TRIzol®)
193 followed by centrifugation at 7,500 x g for 5 min at room temperature. The supernatant
194 was removed and the RNA pellet was air-dried.
195
196 The extracted total RNA from each sample was DNase-treated using a TURBO DNA-
197 Free Kit (Ambion, CA, US) to remove genomic DNA contamination. RNA quantity and
198 quality was assessed using the NanoDrop ND-1000 UV-Vis Spectrophotometer (Thermo
199 Scientific, MA, US) and gel electrophoresis.
200
201 Preparation of sequencing libraries
202 Polyadenylated RNA was purified from 10 µg of total RNA using the Magnetic mRNA
203 Isolation Kit (New England Biolabs, MA, US). First strand cDNA was synthesized using
204 ProtoScript M-MuLV FS-cDNA Synthesis Kit (New England Biolabs, MA, US)
9
205 according to the manufacturer’s protocol and modified oligonucleotides in Table S1 in
206 File S1. Second strand synthesis was performed by incubating first-strand cDNA with 1x
207 NEBNext Second Strand Synthesis Buffer (New England Biolabs, MA, US), 0.2 mM
208 dNTPs, 15 units of E.coli DNA Ligase (New England Biolabs, MA, US), 75 units of E.
209 coli DNA polymerase I (New England Biolabs, MA, US) and 3 units of RNase H (New
210 England Biolabs, MA, US) for 2 hr at 16°. cDNA was purified using the GeneJet PCR
211 Purification Kit (Fermentas, MA, US) then fragmented using NEBNext dsDNA
212 Fragmentase (New England Biolabs, MA, US) according to the manufacturer’s protocol,
-1 213 with the addition of 5 mM MgCl2 and 1 mg ml BSA (New England Biolabs, MA, US).
214 Fragmented cDNA was purified and the ends repaired using NEB Quick Blunting Kit
215 (New England Biolabs, MA, US) according to manufacturer’s protocol. The product was
216 purified and A-tailed in a reaction with nuclease-free water, 1x NEB Standard Taq Buffer
217 (New England Biolabs, MA, US), 1 mM dATP, and 2 units of NEB Standard Taq (New
218 England Biolabs, MA, US) at 68° for 2 hours. Tailed templates were ligated to double
219 stranded adaptors prepared with oligonucleotides from the Illumina Customer Sequence
220 Letter (version August 12, 2014 (Illumina 2014), Table S1 in File S1). Purified, tailed
221 cDNA was combined with T4 DNA Ligase Buffer (New England Biolabs, MA, US), T4
222 DNA Ligase (New England Biolabs, MA, US), and the double stranded adaptors and the
223 solution was incubated at 12° for at least 6 hours. Ligation products were purified then
224 amplified using custom sample-specific barcodes (“indices”) designed with a 3-bp
225 minimum Hamming distance based on Illumina barcodes (Illumina 2014) (Table S1 in
226 File S1). PCR included template cDNA, Phusion Taq polymerase buffer (Thermo
227 Scientific, MA, US), dNTPs, 5’ Illumina “i5” barcoding oligo and 3’ Illumina “i7”
10
228 multiplex oligonucleotide and Phusion High Fidelity Taq polymerase (Thermo Scientific,
229 MA, US). Reactions were incubated at 98° for 30 seconds, followed by 17-21 cycles of:
230 98° for 10 s, 63° for 30 s, 72° for 1.5 min. Reactions were amplified for the minimum
231 cycle number required to produce a visible product on a 1% agarose gel. PCR products
232 were size-selected by excising the 350-550 bp fraction from a 2% agarose gel. Finally,
233 size-selected sequencing libraries were extracted using the E.Z.N.A. Gel Extraction Kit
234 (Omega Bio-Tek, GA, US).
235
236 Sequencing, processing and assembly
237 cDNA libraries were sequenced on Illumina HiSeq 2000 at University of Oregon’s
238 Genomics Core Facility (Eugene, OR). All cDNA libraries were pooled on a single lane
239 to produce 100 bp paired-end reads. Raw sequences were filtered using custom Perl
240 scripts to remove uninformative (matching adaptors in Table S1 in File S1, or poly-A
241 tail) and low quality reads (> 20 positions with quality scores < 20) (Meyer et al. 2011).
242 All custom scripts used in this study are available online at GitHub
243 (https://github.com/Eli-Meyer). The high-quality filtered reads were then assembled
244 using default settings in Trinity v2.0.2, a de Bruijn graph based assembler that uses
245 paired-end data to reconstruct transcripts and group these into components intended to
246 represent the collection of transcripts originating from a single gene (Grabherr et al.
247 2011).
248
249 Functional annotation
11
250 To develop these assemblies as resources for functional studies, we assigned putative
251 gene names and functional categories (Gene Ontology, GO; and Kyoto Encyclopedia of
252 Genes and Genomes; KEGG) to assembled transcripts based on sequence comparisons
253 with online databases. All sequence comparisons were conducted using BLAST+ from
254 National Center for Biotechnology Information (NCBI) (Package version 2.2.29)
255 (Altschul et al. 1990). Gene names were assigned by comparing transcript sequences
256 against UniProt protein sequence databases (SwissProt and TREMBL) using BLASTx
257 with an expect value (E value) cutoff of 10-4. Each transcript was assigned a gene name
258 based on its best match, excluding matches with uninformative names (e.g.
259 uncharacterized, unknown, or hypothetical). GO terms describing biological processes,
260 molecular functions, and cellular components were assigned to each transcript based on
261 GO-UniProt associations of its best match, downloaded from the Gene Ontology website
262 (The Gene Ontology et al. 2000). KEGG orthology terms were assigned from single-
263 directional best hit BLAST searches of each transcriptome on the KEGG Automatic
264 Annotation Server (Moriya et al. 2007).
265
266 Reference transcriptome databases
267 The sequence data used in this study have been archived in several public repositories.
268 Raw sequence data have been deposited in the Sequence Read Archive at NCBI
269 [Accession number: SRP063463]. The annotated assemblies have been archived at the
270 Dryad Digital Repository [Accession number: doi:10.5061/dryad.3f08f].
271
12
272 To enhance the utility of these resources for the coral research community, we have also
273 developed searchable databases and made these publicly available on the author’s
274 laboratory website hosted at Oregon State University (Meyer). Databases were produced
275 using the open-source SQLite software library and can be queried directly using a
276 publicly accessible web form. To demonstrate the utility of our searchable databases for
277 rapidly identifying genes of interest, we searched each database for a few genes
278 previously studied in cnidarians, including a cell adhesion molecule (sym32) (Reynolds
279 et al. 2000), a cysteine biosynthesis enzyme (cystathionine β-synthase, Cbs) (Shinzato et
280 al. 2011), and a fluorescent protein (GFP) (Mazel et al. 2003; Shinzato et al. 2012b;
281 Smith-Keune and Dove 2008). For comparison with these simple text searches, we also
282 conducted a more comprehensive search for each gene based on reciprocal BLAST.
283 Representative sequences for each gene were obtained from the UniProt database
284 (version 2014_09, downloaded October 20, 2014), and searched against each assembly
285 using tBLASTn (bit-score ≥ 45). The matching transcripts were then reciprocally
286 compared against UniProt using BLASTx. Reciprocal matches were evaluated at the
287 level of gene names: transcripts identified by searching with a target gene (e.g. B5T1L4,
288 GFP from Acropora millepora) were accepted if they reciprocally matched a different
289 gene with corresponding annotation (e.g. Q9U6Y6, a GFP gene from Anemonia
290 manjano).
291
292 Evaluating gene content and completeness of assembly
293 An ideal reference transcriptome would include all genes present in the genome of an
294 organism but low or tissue-specific expression can lead to incomplete sampling of genes
13
295 during cDNA library preparation. To evaluate the gene representation of our assemblies,
296 we searched each assembly for sequence similarity with a core set of conserved
297 eukaryotic genes (CEGMA; (Parra et al. 2007)) and with gene models from sequenced
298 anthozoan genomes: the coral Acropora digitifera [OIST: adi_v1.0.1] (Shinzato et al.
299 2011) and the anemone Nematostella vectensis [assembly version: Nemve1] (Putnam et
300 al. 2007). Sequence comparisons were conducted using NCBI’s BLASTx Basic Local
301 Alignment Search Tool (Altschul et al. 1990), and bit-scores ≥ 50 considered significant.
302
303 An ideal transcriptome assembly would also include complete transcripts as contiguous
304 sequences or contigs, but variation in coverage and sequence characteristics lead to
305 fragmented assemblies consisting of partial transcripts. To evaluate the effectiveness of
306 our assemblies in reconstructing complete transcripts, we calculated the Ortholog Hit
307 Ratio (OHR), a metric ranging from 0-1 that indicates the proportion of each gene
308 included in the assembled transcript (O'Neil et al. 2010). Each assembly was compared to
309 gene models from the N. vectensis genome using BLASTx to identify orthologs. We
310 calculated OHR first with a relatively stringent approach (OHRHITS), as the proportion of
311 each N. vectensis gene included within local alignments with assembled transcripts (HSPs
312 in BLASTx output). Since this approach excludes divergent regions, we calculated OHR
313 with an alternative and more inclusive approach (OHRORF), as the ratio of the transcript’s
314 longest ORF (in the BLASTx-defined reading frame) relative to the length of its
315 corresponding N. vectensis protein. When multiple transcripts matched a single gene we
316 considered only the longest OHR. Distributions of maximum OHR scores and summary
317 statistics were examined to evaluate the completeness of each assembly.
14
318
319 Screening for biological contamination
320 All species used in this study engage in symbiotic associations with intracellular
321 dinoflagellate symbionts and therefore RNA extracted from these specimens is expected
322 to include contributions from both animal hosts and dinoflagellate symbionts. To evaluate
323 these contributions we conducted a series of sequence comparisons aiming to identify the
324 taxonomic origin of each transcript (Figure 1). Transcripts were compared with a series
325 of sequence databases using BLAST v.2.2.29 with a bit-score threshold of 45. To identify
326 transcripts derived from rRNA, each assembly was compared with cnidarian rRNA
327 sequences using BLASTn. N. vectensis sequences were chosen for this purpose as they
328 represent the most complete cnidarian sequences in the SILVA rRNA database [SILVA:
329 ABAV01023297, ABAV01023333] (Quast et al. 2012). Transcripts were compared with
330 a cnidarian mitochondrial genome using BLASTn; for this analysis, we chose the
331 complete mitochondrial genome from Acropora tenuis [NCBI: NC_003522.1] (van
332 Oppen et al. 2002). To identify the taxonomic origin of each transcript, sequences were
333 compared with the NCBI non-redundant (nr) protein database (downloaded March 12,
334 2014) using BLASTx (E value 10-5) (Altschul et al. 1990). To avoid errors that might
335 arise from the scarcity of cnidarian and dinoflagellate sequences in these databases,
336 transcripts were compared with gene models from Symbiodinium minutum (clade B)
337 [OIST: symbB.v1.2.augustus.prot] and A. digitifera [OIST: adi_v1.0.1_prot] using
338 BLASTx. The taxonomic origin of each sequence was categorized as follows. First,
339 transcripts matching rRNA or mitochondrial sequences were assigned to those categories.
340 Transcripts matching Symbiodinium genes more closely than coral genes, that did not
15
341 return a metazoan hit as their best match in nr, were assigned to the dinoflagellate
342 category. Transcripts matching coral genes more closely than Symbiodinium genes, that
343 also matched metazoan records or lacked matches in nr, were categorized as metazoan.
344 Transcripts that showed conflicting results (metazoan in one db but non-metazoan in the
345 other) were categorized as “unknown”. Transcripts lacking any match to either coral or
346 Symbiodinium genes were assigned based on taxonomic annotation of the best match in
347 nr, if available. This series of decisions made it possible to classify each transcript based
348 on origin (ribosomal, mitochondrial, other metazoan genes, dinoflagellate, or “other
349 taxa”, which includes prokaryotes, “uncertain”, or “no match”).
350
351 Screening for cross-contamination
352 During preliminary analysis of the transcriptome assemblies, we observed a few
353 orthologs with unexpectedly high sequence similarity (> 99%) among species.
354 Since cross-contamination could realistically occur at several different stages during
355 multiplex library preparation and sequencing, we tested for evidence of cross-
356 contamination in our transcriptome assemblies and developed a pipeline to eliminate
357 contaminating sequences. To evaluate the extent of cross contamination in our libraries,
358 we mapped the cleaned reads used to produce each assembly against that assembly using
359 the Trinity utility align_and_estimate_abundance.pl (Haas et al. 2013). We then
360 compared all transcriptome libraries sequenced and prepared together using BLASTn to
361 identify nearly identical sequences present in multiple assemblies (bit-score ≥ 100). This
362 analysis identified many sequences occurring in multiple assemblies, which were highly
363 abundant in one sample (consistent with this being their true origin) but very low
16
364 abundance (<10-fold lower) in other assemblies (consistent with cross-contamination).
365 To evaluate the level of sequence similarity expected among anthozoan transcriptomes,
366 for comparison with the similarity observed among our assemblies, we compared
367 publicly available transcript assemblies produced independently in different labs
368 (Pocillopora damicornis, (Traylor-Knowles et al. 2011); A. digitifera, (Shinzato et al.
369 2011); and A. millepora, (Meyer et al. 2009)). To eliminate putative cross-contaminants
370 identified in our assemblies, we first compared assemblies using BLASTn to identify
371 highly similar sequences (bit-score ≥ 100). We then estimated the abundance of each
372 transcript in each assembly by mapping and counting reads from each library against the
373 assembly produced from those reads, using the Trinity utility
374 align_and_estimate_abundance.pl. To identify and remove sequences that might result
375 from cross-contamination, we categorized each transcript based on sequence similarity
376 and relative expression in all other assemblies. Any transcripts with nearly-identical
377 matches in more than one assembly were assigned to the assembly in which each was
378 most abundant, if the sequence was at least 10-fold more abundant in that library than any
379 others. Alternatively, transcripts found at comparable levels (< 10-fold difference) in
380 multiple assemblies were flagged as “unknown origin” and excluded from further
381 analysis.
382
383 Development of SSR markers
384 Simple sequence repeats (SSRs), also known as microsatellites, are sequences with
385 repetitive 2-5 base pairs of DNA. These molecular markers have been widely used for
386 studies of genome mapping, genetic linkage and population structure. Although SSRs
17
387 have largely been replaced with sequencing-based approaches for single nucleotide
388 polymorphism (SNP) genotyping, in some situations they may still be the most practical
389 option. To demonstrate the utility of transcriptome assemblies for SSR marker
390 development and identify SSR markers for the four species described here, we used a
391 pipeline we have previously described for identifying SSRs in coral sequence data
392 (Davies et al. 2013). In brief, sequences containing repetitive regions (≥ 30 bp, ≤ 15%
393 deviation from perfect repeat structure, ≥ 30 bp flanking regions) were identified using
394 RepeatMasker (Smit et al. 1996-2010), and then assembled using CAP3 to eliminate
395 redundancy (Huang and Madan 1999). Target sequences were further screened for
396 redundancy using BLASTn (Altschul et al. 1990) to identify unique targets within each
397 repeat type (e.g., AT, CCG, etc.). Finally, primer sequences flanking these SRRs were
398 developed using Primer3 (Rozen and Skaletsky 1999), targeting regions 150-500 bp with
399 45-65% GC content.
400
401 Identification of orthologous groups
402 To facilitate comparative studies of cnidarian gene sequences, and demonstrate the utility
403 of our transcriptome assemblies for phylogenetic analysis, we identified orthologous
404 groups among the four transcriptomes generated in this study. We also compared these
405 with sequence resources from other cnidarians and basal metazoans, including a marine
406 sponge Amphimedon queenslandica (Srivastava et al. 2010), the hydrozoan Hydra
407 magnipapillata (Chapman et al. 2010), the schyphozoan Aurelia aurita (Fuchs et al.
408 2014), and a variety of other anthozoans including Aiptasia pallida (Lehnert et al. 2012),
409 N. vectensis (Putnam et al. 2007), A. digitifera (Shinzato et al. 2011), Porites asteroides
18
410 (Kenkel et al. 2013), P. damicornis (Vidal-Dupiol et al. 2013), Stylophora pistillata
411 (Karako-Lampert et al. 2014), Orbicella faveolata (formerly belonging to the genus
412 Montastraea (Budd and Stolarski 2011; DeSalvo et al. 2008)), and Pseudodiploria
413 strigosa (Table S2 in File S1). These resources varied in the types of sequencing
414 technologies used to create them and this resulted in differing degrees of assembly
415 completeness, ranging from whole genomes to EST libraries (Table S2 in File S1). All
416 resources were converted into candidate protein coding sequences using the package
417 TransDecoder (transdecoder.sourceforge.net) that identifies open reading frames. Protein
418 sequences were then processed with FastOrtho (enews.patricbrc.org/fastortho), an
419 OrthoMCL based program (Li et al. 2003) that performs an all-by-all BLAST of the input
420 sequences (E value cutoff 10-5) and clusters orthologous groups with the Markov
421 Cluster algorithm (Van Dongen 2000).
422
423 Phylogenetic analysis
424 The four transcriptomes from this study and other sequence resources were used to infer
425 phylogenetic relationships from commonly used markers and newly identified orthologs.
426 The mitochondrial gene cytochrome c oxidase 1 (COI) has been used to reconstruct the
427 most comprehensive phylogeny of corals (Anthozoa, Scleractina) (Kitahara et al. 2010)
428 and mitochondrial sequences are commonly used to infer evolutionary relationships of
429 the Cnidaria (Kitahara et al. 2010; Bridge et al. 1992; Kayal et al. 2013). Recent findings
430 suggest, however, that a concatenated set of NADH dehydrogenase genes (ND 2, 4 and 5)
431 outperforms COI in metazoan datasets including in anthozoans (Havird and Santos 2014).
432
19
433 To investigate the effect of increased gene sampling on phylogenetic inferences, we
434 compared phylogenetic trees constructed based on (a) the widely-used marker COI, (b)
435 the ND supergene, and (c) the set of orthologs identified from a comparison of our
436 transcriptomes with other cnidarian sequence resources. All taxa used in searches for
437 orthologous groups were included and A. queenslandica served as the outgroup. The
438 Transdecoder catalog of proteins for each organism was made into a local BLAST
439 database. Then, the mitochondrial protein sequences of COI, ND2, ND4 and ND5 were
440 found from BLASTx searches against our local databases, UniProt or NCBI databases
441 (Tables S3 and S4 in File S1). In some cases, mitochondrial genes were not recovered
442 from the local protein databases, but were found by tBLASTx to the original resources.
443 These transcripts were instead translated using Expasy Translate Tool
444 (http://web.expasy.org/translate/) under the “invertebrate mitochondrial” genetic code.
445 Proteins sequences for COI, ND2, ND4 and ND5 were aligned using MAFFT v6.864b
446 (Katoh et al. 2002). In some cases, the mitochondrial sequences were fragmented within a
447 single database or recovered from two separate databases (Tables S3 and S4 in File S1).
448 These fragments were aligned and manually combined to increase total alignment
449 positions. Individual MAFFT alignments of ND2, ND4 and ND5 were concatenated into
450 a single matrix in Mesquite (v. 3.02) (Maddison and Maddison 2011). Protein alignments
451 of COI and the ND genes were run through ProtTest server
452 (http://darwin.uvigo.es/software/prottest_server.html) (Abascal et al. 2005) to select the
453 appropriate substitution rate model based on AIC and BIC criterion. Phylogenetic trees
454 were constructed using maximum likelihood (ML) in RAxML v. 8.0.26 (Stamatakis
455 2014) under the MTZOA+G+F model (Rota-Stabelli et al. 2009). Optimal topology was
20
456 selected based on ML scores from 500 replicate trees. Nodal support was assessed from
457 500 bootstrap replicates.
458
459 For phylogenomic reconstruction, the computational pipeline PhyloTreePruner (Kocot et
460 al. 2013) was applied to orthologous groups with a minimum amino acid length of 100
461 from the 15 taxa identified in Table S2 in File S1. PhyloTreePruner is a phylogenetic
462 approach used to refine orthologous groups identified in programs like OrthoMCL by
463 removing predicted paralogs resulting from gene duplication or splice variants through
464 single gene-tree evaluation (Kocot et al. 2013). First, each group of orthologs was aligned
465 using MAFFT v. 6.864b with 1000 iterations. Ambiguous or uninformative positions
466 were removed from the alignment using Gblocks v. 0.91b (Castresana 2000). Then,
467 single-gene ML trees for each group inferred with FastTree2 (Price et al. 2010) were
468 screened for paralogy with PhyloTreePruner and the longest sequence for each taxon was
469 retained. The pruned orthologous groups were then merged into a single matrix using
470 FASconCAT v. 1.0 (Kück and Meusemann 2010). To examine the impact of missing data
471 on tree topology, two trees were constructed. In the conservative tree, 14-15 taxa were
472 sampled per ortholog for a total of 397 groups (73,833 unique alignment positions). The
473 relaxed tree allowed more missing data, requiring only at least 10 taxa sampled per
474 ortholog for a total of 2,896 groups (535,413 unique alignment positions). For each
475 dataset, ML trees were inferred with RAxML v. 8.0.26 using the WAG+GAMMA+F
476 substitution model (Whelan and Goldman 2001). Topology for each tree was selected
477 from 100 replicate trees, and nodal support values are based on 100 and 500 bootstrap
478 replicates in the conservative and relaxed trees respectively.
21
479
480 RESULTS AND DISCUSSION
481 Sequencing and de novo assembly
482 The four libraries described here were sequenced on Illumina HiSeq 2000 (each
483 occupying 1/6th of a lane), yielding on average 26.3 million paired reads per library
484 (range: 21.2-30.3, Table S5 in File S1). A fraction of these (22% on average; range 14-
485 28%) were removed during quality and adaptor filtering prior to assembly. Assembly of
486 the remaining high-quality reads produced on average ~170,000 transcripts. This is
487 substantially higher than the number of genes in sequenced cnidarian genomes (23,677 in
488 A. digitifera, 27,273 in N. vectensis), which likely results from redundancy,
489 fragmentation in the assemblies and biological contamination. Assemblies included many
490 small contigs (on average, 47% were < 400 bp) that were unlikely to provide significant
491 matches, so for analyses based on sequence homology we considered only contigs ≥ 400
492 bp (average n=91,792). For these core transcriptome datasets used for downstream
493 analyses, the average length ranged from 1.1-1.7 kb and N50 ranged from 1.4-2.7 kb.
494 These are slightly shorter than the expected size distribution for a complete cnidarian
495 transcriptome (e.g. average ~ 1,700 and N50 ~ 2,200 bp transcripts in the A. digitifera
496 genome), suggesting incomplete assemblies. Assembly statistics of the four transcriptome
497 references developed in this study are broadly comparable to previously published
498 anthozoan transcriptomes (Moya et al. 2012; Shinzato et al. 2014; Shinzato et al. 2011;
499 Abascal et al. 2005; Traylor-Knowles et al. 2011; Lehnert et al. 2012).
500
501 Completeness of transcriptomes
22
502 To evaluate the completeness of the transcriptome assemblies from the perspective of
503 gene content, we conducted sequence comparisons with conserved eukaryotic genes and
504 gene models from sequenced relatives. The core eukaryotic genes (CEGMA; (Parra et al.
505 2007)) are expected to be expressed in most eukaryotes (Nakasugi et al. 2013; Sanders et
506 al. 2014) and are widely used to estimate transcriptome completeness. Sequence
507 comparisons revealed matches for 453 of these conserved genes (98.9%) in A.
508 elegantissma and 456 (99.5%) in F. scutaria, M. cavernosa and S. hystrix (Figure 2a).
509 For a more comprehensive view of gene representation, the transcriptomes were
510 compared with gene models from sequenced relatives (the coral A. digitifera and the
511 anemone N. vectensis). This analysis identified matches for more than 14,000 gene
512 models in each genome (BLASTx, bit-score ≥ 50): 54-67% of gene models in A.
513 digitifera (Figure 2b) and 48-49% in N. vectensis. This is comparable to the level of
514 sequence similarity observed among anthozoans with completed genomes. BLASTp
515 comparisons of predicted proteins from the A. digitifera and N. vectensis genomes using
516 the same thresholds recover 35% and 42% of genes in the other genome. This is
517 substantially lower than the optimistic estimates of representation based on CEGMA,
518 perhaps reflecting essential functions and constitutive expression of these highly
519 conserved genes. Comparisons with gene models of closely related taxa appear to provide
520 a more conservative estimate of gene representation in transcriptome assemblies.
521
522 To evaluate the effectiveness of our assemblies in reconstructing complete transcripts, we
523 calculated ortholog hit ratios (OHR) for each final assembly. This method estimates the
524 amount of a de novo transcript contained in the best ortholog from a reference genome
23
525 (O'Neil et al. 2010), ranging from 1 (for complete transcripts) to 0 (for transcript
526 fragments). We calculated OHR based on sequence comparisons with N. vectensis gene
527 models, using two approaches. First, a relatively stringent analysis based on the
528 proportion of each N. vectensis gene included in regions of local similarity (OHRHITS)
529 produced median OHR of 63.8, 64.7, 65.7, and 58.0% for A. elegantissma, F. scutaria,
530 M. cavernosa and S. hystrix, respectively (Figure 2c). A more inclusive analysis based on
531 the longest ORF (in BLAST defined frame) produced similar estimates (median OHRORF:
532 67.4, 75.8, 77.2, and 60.3% respectively). Each assembly included more than 5,000
533 reasonably complete transcripts spanning at least 75% of the corresponding N.vectensis
534 gene (range: 5,262-6,725). Overall, these comparisons with existing cnidarian sequence
535 resources quantify the representation and completeness of our assemblies, and provide a
536 framework for comparison with other de novo assemblies. These estimates compare
537 favorably with previous transcriptome completeness estimates for cnidarians (Sanders et
538 al. 2014) and several invertebrates (O'Neil and Emrich 2013; Riesgo et al. 2012) using
539 similar methods.
540
541 Annotation of transcriptomes
542 Transcripts were annotated using BLAST homology searches against the UniProt
543 databases. Approximately a third of all transcripts matched records in UniProt (range: 30-
544 40%) (Table S5 in File S1). The relatively low fraction of sequences annotated is
545 attributable in part to sequence lengths: on average, 21% of transcripts <400 bp in length
546 were annotated, as compared with 42% of transcripts 400-1000 bp in length and 78% of
547 transcripts > 1,000 bp. Even among the longest transcripts (> 1 kb), a substantial number
24
548 of sequences lacked annotated matches in UniProt (range: 6,647-12,090 sequences per
549 assembly). This highlights the well-known bias in taxonomic composition of existing
550 databases, and the value of ongoing gene sequencing in under-represented metazoan taxa
551 for public sequence databases.
552
553 To categorize the biological functions inferred from sequence similarity, Gene Ontology
554 (GO) terms were assigned to transcripts matching GO-annotated records in the UniProt
555 database. This process identified functional annotation for 77% of transcripts with
556 BLAST matches, providing tentative gene identities for a large number of sequences in
557 each assembly (range: 32,299- 47,547 transcripts; Table S5 in File S1). Figure 3 shows
558 the distribution of functional categories across the four transcriptomes, visualized using
559 the Web Gene Ontology Annotation Plotting (WEGO) application. The GO terms were
560 broadly distributed across the three domains and the percentages of sequences mapped to
561 a given sub-ontology were highly similar for all species, and comparable to other
562 invertebrate transcriptomes (Riesgo et al. 2012; O'Neil et al. 2010; Lehnert et al. 2012;
563 Moya et al. 2012; Polato et al. 2011; Shinzato et al. 2014; Stefanik et al. 2014; Traylor-
564 Knowles et al. 2011). The similarities in functional distributions of assemblies prepared
565 from diverse species, developmental stages, and symbiotic states highlights the
566 constitutive expression of a broad set of genes in cnidarian transcriptomes. These core
567 genes should facilitate comparative transcriptome studies by increasing the overlap
568 among incomplete libraries.
569
25
570 To determine taxonomic origin for each transcript, we conducted a series of BLAST
571 searches and filtering steps outlined in Figure 1. Since our assemblies were produced
572 from symbiotic and aposymbiotic specimens, the transcriptomes contain genes not only
573 from anthozoans but also from their associated microbial community. To investigate the
574 relative contributions of these sources we classified each transcript based on sequence
575 similarity (Figure 1). These analyses confirmed that metazoan sequences comprised the
576 majority of each library as expected. Fortunately, only a small fraction of transcripts were
577 derived from organelles (mitochondria and ribosomes): on average, 212 transcripts
578 (range: 127-284) in each assembly matched rRNA (N. vectensis) and 30 transcripts
579 (range: 16-54) matched the mitochondrial genome (A. tenuis). Notably, almost half of
580 transcripts in each assembly (range: 46.2% to 49.9%) lacked matches to coral or
581 Symbiodinium spp. genes, or NCBI’s nr database (Figure 4), a range that is consistent
582 with results from other anthozoan transcriptomes (Sun et al. 2013; Karako-Lampert et al.
583 2014; Polato et al. 2011; Traylor-Knowles et al. 2011). These ‘unknown’ transcripts may
584 represent lineage-specific genes (‘taxonomically-restricted genes’) that require further
585 characterization. Comparison with NCBI’s nr database revealed that the majority of
586 sequences with matches in one or more databases (59-95%) matched a metazoan
587 sequence better than any other taxon, suggesting they originated from the animal host
588 rather than from dinoflagellate or prokaryotic symbionts. A negligible fraction of
589 transcripts in each assembly (0.8-1.7%) were assigned to the “Other taxa” category, most
590 of which matched either coral or Symbiodinium genes but were classified as “unknown”
591 because of conflicting results in the nr search (e.g. transcripts that matched Symbiodinium
592 more closely than coral, but whose best matches in nr were from metazoans).
26
593
594 The contribution of algal symbionts varied widely across samples. In nominally
595 aposymbiotic samples of F. scutaria and A. elegantissma, 2.6% of transcripts on average
596 were classified as dinoflagellate in origin (Figure 4), which may have resulted either from
597 unexpected presence of symbionts at low abundance in these samples, or genes lacking
598 orthologs in the A. digitifera reference. The symbiotic samples from S. hystrix, in
599 contrast, showed comparable abundance of transcripts classified as metazoan (61,369)
600 and dinoflagellate in origin (41,724). Surprisingly, the M. cavernosa library that was
601 similarly prepared from a symbiotic sample showed only 7,278 transcripts from
602 dinoflagellates (Figure 4). This striking contrast in Symbiodinium contributions from
603 symbiotic specimens may have arisen from differing methods of RNA extraction. For S.
604 hystrix, tissue was airbrushed off the coral skeleton directly into RNAlater® Stabilization
605 Solution (Qiagen, CA, US) followed by complete tissue homogenization. In contrast, the
606 M. cavernosa fragment was simply vortexed to disrupt tissue, without physical
607 homogenization. Our findings suggest that omitting physical homogenization during lysis
608 can minimize symbiont contamination for studies aiming to focus on the cnidarian host,
609 while studies investigating both components may benefit from thorough homogenization
610 during extraction. The gene names, functional categories, and putative origin of each
611 transcript are annotated in Tables S6-9 in Files S2-5.
612
613 Gene searches of the database
614 The resulting annotations and sequences are available in a set of searchable databases
615 hosted by Oregon State University (Meyer). To illustrate the utility of databases for
27
616 cnidarian researchers targeting specific genes, we compared the effectiveness of simple
617 text searches of the databases with reciprocal BLAST (RB) analysis, a more
618 comprehensive approach that requires additional work by the end-user. Text searches
619 targeting a handful of selected genes (cell adhesion molecule sym32, green fluorescent
620 protein GFP, and cystathionine β-synthase Cbs) produced comparable results as RB
621 searches (Table S10 in File S1). Text searches are obviously sensitive to query phrasing;
622 the query “fluorescent” retrieves 51 putative GFP homologs, and functionally related
623 synonyms (“GFP”, “chromoprotein”) retrieved an additional 10. Interestingly, the Cbs
624 homologs identified in nominally aposymbiotic samples (A. elegantissima and F.
625 scutaria) showed greater sequence similarity with Symbiodinium gene models than coral
626 (A. digitifera) and were classified as dinoflagellate in our assignment procedure (Figure
627 1), while Cbs homologs in symbiotic samples (M. cavernosa and S. hystrix) included both
628 and metazoan and dinoflagellate transcripts. This unexpected observation of apparently
629 dinoflagellate homologs of Cbs in nominally aposymbiotic samples is noteworthy
630 because of their variable distribution among corals and possible roles in coral nutritional
631 dependency on symbiosis (Shinzato et al. 2011). While this finding could be explained by
632 undetected Symbiodinium harbored in these putatively aposymbiotic samples, the
633 uncertainty introduced by these observations suggests that studies investigating the
634 diversity of Cbs homologs across corals may require additional data (e.g. in-situ
635 hybridization) to confirm transcript origins. Overall, the close agreement between
636 rigorous computational searches and simple text searches in these examples illustrates the
637 utility of our searchable online databases for rapidly identifying genes of interest in
638 reference transcriptome assemblies.
28
639
640 Novel SSR markers
641 Simple sequence repeats (SSRs, or microsatellites) have been widely used to study
642 genetic diversity, hybridization events, population structure and connectivity in
643 anthozoans (Concepcion et al. 2010; Fernandez-Silva et al. 2013; Selkoe and Toonen
644 2006; Ruiz-Ramos and Baums 2014), and can directly influence phenotypic traits by
645 altering DNA replication, translation and gene expression (Ruiz-Ramos and Baums
646 2014). SSR markers can be readily identified from de novo assemblies of NGS data, and
647 emerge as a side benefit in transcriptome assembly projects conducted for other purposes.
648 We identified and developed primers for 52, 49, 73 and 75 candidate SSR markers in A.
649 elegantissma, F. scutaria, M. cavernosa and S. hystrix, respectively. Primer pairs for each
650 species are listed in Table S11 in File S6. For three of the species studied here, varying
651 numbers of SSR markers are already available. Previous studies of S. hystrix have
652 developed 10 SSR markers (Maier et al. 2001; Underwood et al. 2006) to study habitat
653 partitioning within a single reef (Bongaerts et al. 2010), dispersal and recruitment
654 patterns across multiple reefs (van Oppen et al. 2008; Kininmonth et al. 2010), and
655 population changes associated with bleaching events (Underwood et al. 2007). Candidate
656 SSR markers have been identified in F. scutaria (n=118) from the coral host and
657 dinoflagellate symbionts (Concepcion et al. 2010). SSR markers previously developed in
658 M. cavernosa (Shearer et al. 2005; Serrano et al. 2014) have been used to investigate the
659 population connectivity across depth and geographic distance (Serrano et al. 2014). The
660 candidate SSR markers identified in this study provide additional markers for future
661 studies along similar lines. To our knowledge, SSR markers have not been previously
29
662 developed in A. elegantissima. Although the population structure of the host has not been
663 described, analysis of their dinoflagellate symbionts revealed highly structured
664 populations across their geographic range (Sanders and Palumbi 2011). The markers
665 developed in this study for A. elegantissima provide tools to investigate population
666 structure of the host across a similar range.
667
668 Orthologous groups and phylogenomic reconstructions
669 With the increasing availability of transcriptomes and genomes, these datasets can now
670 be mined to discover novel phylogenetic markers within Anthozoa and across the
671 Cnidaria to resolve taxonomic uncertainties. Phylogenetic reconstruction of anthozoans
672 has presented challenges because analyses based on morphology, life history, and
673 molecular sequences have failed to adequately delineate taxonomic boundaries or
674 evolutionary relationships (Daly et al. 2003). To date, molecular phylogenies for
675 anthozoans have been based on one or a small number of markers including nuclear
676 ribosomal 28S and 18S genes (Daly et al. 2003; Berntson et al. 1999), β-tubulin (Fukami
677 et al. 2008), mitochondrial 16S (Daly et al. 2003), cytochrome b (Fukami et al. 2008),
678 and COI (Kitahara et al. 2010; Fukami et al. 2008). Interestingly, mitochondrial
679 sequences in anthozoans have extremely low mutation rates compared to the bilaterians
680 and are therefore highly conserved, allowing for robust comparisons across distantly
681 related taxa (van Oppen et al. 2002; Galtier et al. 2009). Therefore, the mitochondrial
682 gene COI has been used recently to define evolutionary relationships among scleractinan
683 corals (Kitahara et al. 2010; Fukami et al. 2008; Budd and Stolarski 2011), and to support
684 the distinction of robust corals from the complex corals (Romano and Palumbi 1996).
30
685
686 One disadvantage to single gene phylogenetic inferences is that they suffer from weak
687 phylogenetic signals, sensitivity to hidden paralogy, and spurious tree artifacts (Philippe
688 et al. 2004). Despite these potential limitations, single gene trees have advanced the field
689 of cnidarian systematics. However, polyphyly remains a problem amongst several
690 anthozoan families when using both maximum likelihood and Bayesian analyses (Fukami
691 et al. 2008; Budd and Stolarski 2011), which has led to recent shifts in taxonomic
692 classification (Budd and Stolarski 2011). To expand beyond previous single-gene
693 approaches, we performed phylogenomic analyses incorporating the four new
694 transcriptomes and other available ‘omic’ resources. By simultaneously increasing taxon
695 and gene sampling, phylogenetic inference is expected to improve (Philippe et al. 2004)
696 and may help resolve some of the challenges in reconstructing the evolutionary
697 relationships of the Anthozoa and more broadly, phylum Cnidaria.
698
699 For phylogenomic analysis, transcripts larger than 400 bp were converted to protein with
700 TransDecoder and clustered into orthologous groups using FastOrtho. The number of
701 assigned orthologous groups ranged from 14,144 to 21,147 for the four transcriptomes
702 (Figure S1 in File S1). Comparison of all four resulted in 6,560 shared orthologs (Figure
703 S1 in File S1). The three coral species shared 2,045 orthologs not found in anemones and
704 the two most closely related corals (M. cavernosa and F. scutaria) shared 1,682 orthologs
705 absent from the other assemblies. By incorporating 11 additional taxa for phylogenomic
706 analysis (Table S2 in File S1), 443 orthologs were identified between all taxa. After
707 setting a minimum protein length (100 amino acids), these orthologs were refined using
31
708 the PhyloTreePruner analysis pipeline (Kocot et al. 2013). Filtering resulted in the
709 identification of 397 orthologs for ≥ 14 taxa. These were used to construct a
710 phylogenetic tree we termed “conservative” because loci with any missing data were
711 excluded (Table S12 in File S7).
712
713 Missing data are a commonly encountered problem in phylogenomic analyses, either
714 from either reduced transcript length or gene absence from a transcriptome (Philippe et
715 al. 2004; Kocot et al. 2013; Roure et al. 2013). However, the sensitivity of phylogenetic
716 inference to incomplete datasets is still under investigation, with mixed results from
717 phylogenomic analyses on large, but patchy supermatrices (Roure et al. 2013; Philippe et
718 al. 2004). Since the resources in this study used for ortholog identification differed in
719 completeness, ranging from EST libraries to complete genomes (Table S2 in File S1), we
720 tested the influence of missing data on our phylogenetic reconstruction. To investigate
721 this, we lowered the required number of taxa per orthologous group to ≥ 10, which
722 identified 2,897 orthologs (Table S12 in File S7). This second set was used to create the
723 “relaxed” phylogeny, so called because loci with some missing data were included.
724
725 Both maximum likelihood phylogenomic analyses reconstructed identical and strongly
726 supported topologies (bootstrap = 100; Figure S2 in File S1), demonstrating that our
727 phylogenetic inference was insensitive to missing data (Figure 5). However, the
728 relationship of the corals in the family Faviidae, containing M. cavernosa, P. strigosa and
729 O. faveolata varied among the COI, ND supergene and phylogenomic analyses. The
730 mitochondrial ND supergene identified by Havird et al. (Havird and Santos 2014)
32
731 produced a phylogenetic tree nearly synonymous with the accepted cnidarian taxonomic
732 relationships and phylogenomic analyses from this study (Kitahara et al. 2010), except
733 for the placement of the M. cavernosa as sister taxon to O. faveolata and P. strigosa. The
734 analysis of single gene COI, resulted in a discordant phylogenetic topology (Figure 5),
735 failing to reconstruct the complex coral clade (P. asteroides and A. digitifera), which was
736 recovered by ND supergene, relaxed and conserved trees (Figure 5, Figure S2 in File S1).
737 In the COI tree, the placement of the F. scutaria, from the family Fungiidae, as sister
738 taxon to P. strigosa and M. cavernosa from the family Faviidae, instead of O. faveolata is
739 incongruent with current taxonomic placement (Figure 5) (Kitahara et al. 2010; Budd and
740 Stolarski 2011). Furthermore, while the phylogenomic analyses placed O. faveolata as
741 sister to P. strigosa with strong support (bootstrap=100), this relationship was not
742 recovered in either mitochondrial phylogeny (Figure S2 in File S1). Overall, the tree
743 topology from the phylogenomic analyses is consistent with accepted evolutionary
744 relationships within Anthozoa (Budd and Stolarski 2011; Fukami et al. 2008; Kitahara et
745 al. 2010).
746
747 CONCLUSION
748 The annotated transcriptome assemblies developed in this study provide useful resources
749 for genomic research in anthozoan species for which sequences resources were
750 previously lacking. The searchable databases developed from these assemblies make it
751 possible to rapidly identify genes of interest from each species. Our ortholog analysis
752 demonstrates the feasibility of phylogenetic inference in corals using transcriptome
753 assemblies from diverse stages and symbiotic states, highlighting a promising path
33
754 toward resolving major uncertainties in the existing phylogeny of scleractinians. Future
755 studies will benefit from the growing body of anthozoan sequence resources, including
756 the four assemblies contributed in this study.
757
758 AVAILABILITY OF SUPPORTING DATA
759 The data sets supporting the results of this article are available from the Sequence Read
760 Archive at NCBI [Accession number: SRP063463], the Dryad Digital Repository
761 [doi:10.5061/dryad.3f08f], and the author’s website
762 [http://people.oregonstate.edu/~meyere/index.html].
763
764 ABBREVIATIONS
765 BLAST: Basic Local Alignment Search Tool; Cbs: cystathionine β-synthase; CEGMA:
766 Core Eukaryotic Genes Mapping Approach; COI: cytochrome oxidase subunit 1; EST:
767 expressed sequence tag; E value: expect value; GO: gene ontology; GFP: green
768 fluorescent protein; mtDNA: mitochondrial DNA; ML: maximum likelihood; NCBI:
769 National Center for Biotechnology Information; ND: NADH dehydrogenase; OHR:
770 ortholog hit ratio; NGS: next-generation sequencing; nr: non-redundant; RB: reciprocal
771 BLAST; rRNA: ribosomal RNA; RNA-Seq: RNA sequencing; SNP: single nucleotide
772 polymorphism; SSR: simple sequence repeats
773
774 COMPETING INTEREST
775 The authors declare no competing interests.
776
34
777 AUTHOR’S CONTRIBUTIONS
778 EM, SK, and AP conceived the investigation. SK, CC and AP performed library
779 preparation and sequencing. CC, AP, and EM assembled and annotated the
780 transcriptomes. SK performed computational analyses related to transcriptome
781 completeness, GO annotation, and phylogenetics. CC complied transcriptome statistics.
782 CC and SK performed targeted gene searches. EM performed cross contamination
783 screens, identified SSR markers and provided bioinformatic expertise. SK, CC and EM
784 made significant contributions to the preparation of the manuscript. All authors revised
785 and approved the final manuscript.
786
787 ACKNOWLEDGEMENTS
788 Research funding was provided by Oregon State University, Department of Integrative
789 Biology. Publication of this article in an open access journal was funded by the Oregon
790 State University Libraries & Press Open Access Fund. We would like to acknowledge
791 Dr. Christine Schnitzler, and the labs of Dr. Tung-Yung Fan and Dr. Andrew Baker for
792 assistance with sample collection. In addition, we would like to thank Sarah Guermond
793 and Emily Weiss for assistance in sample preparation and analysis.
794
795 References 796 797 Abascal, F., R. Zardoya, and D. Posada, 2005 ProtTest: selection of best-fit models of
798 protein evolution. Bioinformatics 21 (9): 2104-2105.
799 Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, 1990 Basic local
800 alignment search tool. J. Mol. Biol. 215 (3): 403-410.
35
801 Barshis, D. J., J. T. Ladner, T. A. Oliver, F. O. Seneca, N. Traylor-Knowles et al., 2013
802 Genomic basis for coral resilience to climate change. Proc. Natl. Acad. Sci. U. S.
803 A. 110 (4): 1387-1392.
804 Berntson, E. A., S. C. France, and L. S. Mullineaux, 1999 Phylogenetic relationships
805 within the class Anthozoa (phylum Cnidaria) based on nuclear 18S rDNA
806 sequences. Mol. Phylogenet. Evol. 13 (2): 417-433.
807 Bongaerts, P., C. Riginos, T. Ridgway, E. M. Sampayo, M. J. H. van Oppen et al., 2010
808 Genetic Divergence across Habitats in the Widespread Coral Seriatopora hystrix
809 and Its Associated Symbiodinium. PLoS One 5 (5): e10871.
810 Bridge, D., C. W. Cunningham, B. Schierwater, R. DeSalle, and L. W. Buss, 1992 Class-
811 level relationships in the phylum Cnidaria: evidence from mitochondrial genome
812 structure. Proc. Natl. Acad. Sci. U. S. A. 89 (18): 8750-8753.
813 Budd, A. F., and J. Stolarski, 2011 Corallite wall and septal microstructure in
814 scleractinian reef corals: comparison of molecular clades within the family
815 Faviidae. J. Morphol. 272 (1): 66-88.
816 Castresana, J., 2000 Selection of conserved blocks from multiple alignments for their use
817 in phylogenetic analysis. Mol. Biol. Evol. 17 (4): 540-552.
818 Chapman, J. A., E. F. Kirkness, O. Simakov, S. E. Hampson, T. Mitros et al., 2010 The
819 dynamic genome of Hydra. Nature 464 (7288): 592-596.
820 Closek, C. J., S. Sunagawa, M. K. DeSalvo, Y. M. Piceno, T. Z. DeSantis et al., 2014
821 Coral transcriptome and bacterial community profiles reveal distinct Yellow Band
822 Disease states in Orbicella faveolata. The ISME journal 8: 2411-2422.
36
823 Conaco, C., P. Neveu, H. Zhou, M. L. Arcila, S. M. Degnan et al., 2012 Transcriptome
824 profiling of the demosponge Amphimedon queenslandica reveals genome-wide
825 events that accompany major life cycle transitions. BMC Genomics 13 (1): 209.
826 Concepcion, G., N. Polato, I. Baums, and R. Toonen, 2010 Development of microsatellite
827 markers from four Hawaiian corals: Acropora cytherea, Fungia scutaria,
828 Montipora capitata and Porites lobata. Conserv. Genet. Resour. 2 (1): 11-15.
829 Daly, M., D. G. Fautin, and V. A. Cappola, 2003 Systematics of the Hexacorallia
830 (Cnidaria: Anthozoa). Zool. J. Linn. Soc. 139 (3): 419-437.
831 Davies, S. W., M. Rahman, E. Meyer, E. A. Green, E. Buschiazzo et al., 2013 Novel
832 polymorphic microsatellite markers for population genetics of the endangered
833 Caribbean star coral, Montastraea faveolata. Mar. Biodivers. 43 (2): 167-172.
834 Davy, S. K., D. Allemand, and V. M. Weis, 2012 Cell biology of cnidarian-dinoflagellate
835 symbiosis. Microbiol. Mol. Biol. Rev. 76 (2): 229-261.
836 DeSalvo, M., C. Voolstra, S. Sunagawa, J. Schwarz, J. Stillman et al., 2008 Differential
837 gene expression during thermal stress and bleaching in the Caribbean coral
838 Montastraea faveolata. Mol. Ecol. 17 (17): 3952-3971.
839 Douglas, A. E., 2003 Coral bleaching - how and why? Mar. Pollut. Bull. 46 (4): 385-392.
840 Dunn, C. W., A. Hejnol, D. Q. Matus, K. Pang, W. E. Browne et al., 2008 Broad
841 phylogenomic sampling improves resolution of the animal tree of life. Nature 452
842 (7188): 745-749.
843 Fernandez-Silva, I., J. Whitney, B. Wainwright, K. R. Andrews, H. Ylitalo-Ward et al.,
844 2013 Microsatellites for next-generation ecologists: a post-sequencing
845 bioinformatics pipeline. PLoS One 8 (2): e55990.
37
846 Fuchs, B., W. Wang, S. Graspeuntner, Y. Li, S. Insua et al., 2014 Regulation of polyp-to-
847 jellyfish transition in Aurelia aurita. Curr. Biol. 24 (3): 263-273.
848 Fukami, H., C. A. Chen, A. F. Budd, A. Collins, C. Wallace et al., 2008 Mitochondrial
849 and nuclear genes suggest that stony corals are monophyletic but most families of
850 stony corals are not (Order Scleractinia, Class Anthozoa, Phylum Cnidaria). PLoS
851 One 3 (9): e3222.
852 Galtier, N., R. W. Jobson, B. Nabholz, S. Glémin, and P. U. Blier, 2009 Mitochondrial
853 whims: metabolic rate, longevity and the rate of molecular evolution. Biol. Lett. 5
854 (3): 413-416.
855 Grabherr, M. G., B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson et al., 2011
856 Trinity: reconstructing a full-length transcriptome without a genome from RNA-
857 Seq data. Nat. Biotechnol. 29 (7): 644-652.
858 Graveley, B. R., A. N. Brooks, J. W. Carlson, M. O. Duff, J. M. Landolin et al., 2011 The
859 developmental transcriptome of Drosophila melanogaster. Nature 471 (7339):
860 473-479.
861 Haas, B. J., A. Papanicolaou, M. Yassour, M. Grabherr, P. D. Blood et al., 2013 De novo
862 transcript sequence reconstruction from RNA-seq using the Trinity platform for
863 reference generation and analysis. Nat. Protoc. 8 (8): 1494-1512.
864 Hamada, M., E. Shoguchi, C. Shinzato, T. Kawashima, D. J. Miller et al., 2012 The
865 complex NOD-like receptor repertoire of the coral Acropora digitifera includes
866 novel domain combinations. Mol. Biol. Evol.: mss213.
38
867 Havird, J. C., and S. R. Santos, 2014 Performance of single and concatenated sets of
868 mitochondrial genes at inferring metazoan relationships relative to full
869 mitogenome data. PLoS One 9 (1): e84080.
870 Helm, R. R., S. Siebert, S. Tulin, J. Smith, and C. W. Dunn, 2013 Characterization of
871 differential transcript abundance through time during Nematostella vectensis
872 development. BMC Genomics 14 (1): 266.
873 Huang, X., and A. Madan, 1999 CAP3: a DNA sequence assembly program. Genome
874 Res. 9 (9): 868-877.
875 Illumina, 2014 Illumina Customer Sequence Letter. Illumina, Inc. , San Diego.
876 Karako-Lampert, S., D. Zoccola, M. Salmon-Divon, M. Katzenellenbogen, S. Tambutté
877 et al., 2014 Transcriptome analysis of the scleractinian coral Stylophora pistillata.
878 PLoS One 9 (2): e88615.
879 Katoh, K., K. Misawa, K. i. Kuma, and T. Miyata, 2002 MAFFT: a novel method for
880 rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids
881 Res. 30 (14): 3059-3066.
882 Kayal, E., B. Roure, H. Philippe, A. G. Collins, and D. V. Lavrov, 2013 Cnidarian
883 phylogenetic relationships as revealed by mitogenomics. BMC Evol. Biol. 13 (1):
884 5.
885 Kenkel, C., E. Meyer, and M. Matz, 2013 Gene expression under chronic heat stress in
886 populations of the mustard hill coral (Porites astreoides) from different thermal
887 environments. Mol. Ecol. 22 (16): 4322-4334.
39
888 Kininmonth, S., M. J. H. van Oppen, and H. P. Possingham, 2010 Determining the
889 community structure of the coral Seriatopora hystrix from hydrodynamic and
890 genetic networks. Ecol. Modell. 221 (24): 2870-2880.
891 Kitahara, M. V., S. D. Cairns, J. Stolarski, D. Blair, and D. J. Miller, 2010 A
892 comprehensive phylogenetic analysis of the Scleractinia (Cnidaria, Anthozoa)
893 based on mitochondrial CO1 sequence data. PLoS One 5 (7): e11490.
894 Kocot, K. M., J. T. Cannon, C. Todt, M. R. Citarella, A. B. Kohn et al., 2011
895 Phylogenomics reveals deep molluscan relationships. Nature 477 (7365): 452-
896 456.
897 Kocot, K. M., M. R. Citarella, L. L. Moroz, and K. M. Halanych, 2013 PhyloTreePruner:
898 a phylogenetic tree-based approach for selection of orthologous sequences for
899 phylogenomics. Evol. Bioinform. Online 9: 429-435.
900 Kück, P., and K. Meusemann, 2010 FASconCAT: Convenient handling of data matrices.
901 Mol. Phylogenet. Evol. 56 (3): 1115-1118.
902 Lehnert, E. M., M. S. Burriesci, and J. R. Pringle, 2012 Developing the anemone Aiptasia
903 as a tractable model for cnidarian-dinoflagellate symbiosis: the transcriptome of
904 aposymbiotic A. pallida. BMC Genomics 13 (1): 271.
905 Lehnert, E. M., M. E. Mouchka, M. S. Burriesci, N. D. Gallo, J. A. Schwarz et al., 2014
906 Extensive differences in gene expression between symbiotic and aposymbiotic
907 Cnidarians. G3: Genes| Genomes| Genetics 4 (2): 277-295.
908 Li, L., C. J. Stoeckert, and D. S. Roos, 2003 OrthoMCL: Identification of ortholog groups
909 for eukaryotic genomes. Genome Res. 13 (9): 2178-2189.
40
910 Maddison, W. P., and D. R. Maddison, 2011 Mesquite: a modular system for
911 evolutionary analysis.
912 Maier, E., R. Tollrian, and B. Nürnberger, 2001 Development of species-specific markers
913 in an organism with endosymbionts: microsatellites in the scleractinian coral
914 Seriatopora hystrix. Mol. Ecol. Notes 1 (3): 157-159.
915 Marlow, H. Q., M. Srivastava, D. Q. Matus, D. Rokhsar, and M. Q. Martindale, 2009
916 Anatomy and development of the nervous system of Nematostella vectensis, an
917 anthozoan cnidarian. Dev. Neurobiol. 69 (4): 235-254.
918 Mazel, C. H., M. P. Lesser, M. Y. Gorbunov, T. M. Barry, J. H. Farrell et al., 2003
919 Green-fluorescent proteins in Caribbean corals. Limnol. Oceanogr. 48 (1): 402-
920 411.
921 Meyer, E., Meyer Laboratory Website. http://people.oregonstate.edu/~meyere/index.html
922 Meyer, E., G. V. Aglyamova, and M. V. Matz, 2011 Profiling gene expression responses
923 of coral larvae (Acropora millepora) to elevated temperature and settlement
924 inducers using a novel RNA-Seq procedure. Mol. Ecol. 20 (17): 3599-3616.
925 Meyer, E., G. V. Aglyamova, S. Wang, J. Buchanan-Carter, D. Abrego et al., 2009
926 Sequencing and de novo analysis of a coral larval transcriptome using 454 GSFlx.
927 BMC Genomics 10 (1): 219.
928 Meyer, E., and V. M. Weis, 2012 Study of cnidarian-algal symbiosis in the “omics” age.
929 Biol. Bull. 223 (1): 44-65.
930 Moriya, Y., M. Itoh, S. Okuda, A. C. Yoshizawa, and M. Kanehisa, 2007 KAAS: an
931 automatic genome annotation and pathway reconstruction server. Nucleic Acids
932 Res. 35 (suppl 2): W182-W185.
41
933 Moya, A., L. Huisman, E. Ball, D. Hayward, L. Grasso et al., 2012 Whole transcriptome
934 analysis of the coral Acropora millepora reveals complex responses to
935 CO2‐driven acidification during the initiation of calcification. Mol. Ecol. 21 (10):
936 2440-2454.
937 Nakasugi, K., R. N. Crowhurst, J. Bally, C. C. Wood, R. P. Hellens et al., 2013 De Novo
938 transcriptome sequence assembly and analysis of RNA silencing genes of
939 Nicotiana benthamiana. PLoS One 8 (3): e59534.
940 O'Neil, S., J. Dzurisin, R. Carmichael, N. Lobo, S. Emrich et al., 2010 Population-level
941 transcriptome sequencing of nonmodel organisms Erynnis propertius and Papilio
942 zelicaon. BMC Genomics 11 (1): 310.
943 O'Neil, S., and S. Emrich, 2013 Assessing De Novo transcriptome assembly metrics for
944 consistency and utility. BMC Genomics 14 (1): 465.
945 Parra, G., K. Bradnam, and I. Korf, 2007 CEGMA: a pipeline to accurately annotate core
946 genes in eukaryotic genomes. Bioinformatics 23 (9): 1061-1067.
947 Philippe, H., E. A. Snell, E. Bapteste, P. Lopez, P. W. Holland et al., 2004
948 Phylogenomics of eukaryotes: impact of missing data on large alignments. Mol.
949 Biol. Evol. 21 (9): 1740-1752.
950 Polato, N. R., J. C. Vera, and I. B. Baums, 2011 Gene discovery in the threatened elkhorn
951 coral: 454 sequencing of the Acropora palmata transcriptome. PLoS One 6 (12):
952 e28634.
953 Poole, A. Z., and V. M. Weis, 2014 TIR-domain-containing protein repertoire of nine
954 anthozoan species reveals coral–specific expansions and uncharacterized proteins.
955 Dev. Comp. Immunol. 46 (2): 480-488.
42
956 Pratlong, M., A. Haguenauer, O. Chabrol, C. Klopp, P. Pontarotti et al., 2015 The red
957 coral (Corallium rubrum) transcriptome: a new resource for population genetics
958 and local adaptation studies. Mol. Ecol. Resour. 15: 1205-1215.
959 Price, M. N., P. S. Dehal, and A. P. Arkin, 2010 FastTree 2 – approximately Maximum-
960 Likelihood trees for large alignments. PLoS One 5 (3): e9490.
961 Putnam, N. H., M. Srivastava, U. Hellsten, B. Dirks, J. Chapman et al., 2007 Sea
962 anemone genome reveals ancestral eumetazoan gene repertoire and genomic
963 organization. Science 317 (5834): 86-94.
964 Quast, C., E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer et al., 2012 The SILVA
965 ribosomal RNA gene database project: improved data processing and web-based
966 tools. Nucleic Acids Res.: gks1219.
967 Reynolds, W. S., J. A. Schwarz, and V. M. Weis, 2000 Symbiosis-enhanced gene
968 expression in cnidarian-algal associations: cloning and characterization of a
969 cDNA, sym32, encoding a possible cell adhesion protein. Comp. Biochem. 126
970 (1): 33-44.
971 Riesgo, A., S. C. Andrade, P. Sharma, M. Novo, A. Perez-Porro et al., 2012 Comparative
972 description of ten transcriptomes of newly sequenced invertebrates and efficiency
973 estimation of genomic sampling in non-model taxa. Front. Zoology 9 (1): 33.
974 Riesgo, A., N. Farrar, P. J. Windsor, G. Giribet, and S. P. Leys, 2014 The analysis of
975 eight transcriptomes from all poriferan classes reveals surprising genetic
976 complexity in sponges. Mol. Biol. Evol. 31 (5): 1102-1120.
977 Romano, S. L., and S. R. Palumbi, 1996 Evolution of scleractinian corals inferred from
978 molecular systematics. Science 271 (5249): 640-642.
43
979 Rota-Stabelli, O., Z. Yang, and M. J. Telford, 2009 MtZoa: A general mitochondrial
980 amino acid substitutions model for animal evolutionary studies. Mol. Phylogenet.
981 Evol. 52 (1): 268-272.
982 Roure, B., D. Baurain, and H. Philippe, 2013 Impact of missing data on phylogenies
983 inferred from empirical phylogenomic data sets. Mol. Biol. Evol. 30 (1): 197-214.
984 Rozen, S., and H. Skaletsky, 1999 Primer3 on the WWW for general users and for
985 biologist programmers, pp. 365-386 in Bioinformatics methods and protocols.
986 Springer.
987 Ruiz-Ramos, D., and I. Baums, 2014 Microsatellite abundance across the Anthozoa and
988 Hydrozoa in the phylum Cnidaria. BMC Genomics 15 (1): 939.
989 Ryan, J. F., P. M. Burton, M. E. Mazza, G. K. Kwong, J. C. Mullikin et al., 2006 The
990 cnidarian-bilaterian ancestor possessed at least 56 homeoboxes: evidence from the
991 starlet sea anemone, Nematostella vectensis. Genome Biol. 7 (7): R64.
992 Ryan, J. F., K. Pang, C. E. Schnitzler, A.-D. Nguyen, R. T. Moreland et al., 2013 The
993 genome of the ctenophore Mnemiopsis leidyi and its implications for cell type
994 evolution. Science 342 (6164).
995 Sanders, J. G., and S. R. Palumbi, 2011 Populations of Symbiodinium muscatinei show
996 strong biogeographic structuring in the intertidal anemone Anthopleura
997 elegantissima. Biol. Bull. 220 (3): 199-208.
998 Sanders, S., M. Shcheglovitova, and P. Cartwright, 2014 Differential gene expression
999 between functionally specialized polyps of the colonial hydrozoan Hydractinia
1000 symbiolongicarpus (Phylum Cnidaria). BMC Genomics 15 (1): 406.
44
1001 Schnitzler, C. E., and V. M. Weis, 2010 Coral larvae exhibit few measurable
1002 transcriptional changes during the onset of coral-dinoflagellate endosymbiosis.
1003 Mar. Genomics 3 (2): 107-116.
1004 Selkoe, K. A., and R. J. Toonen, 2006 Microsatellites for ecologists: a practical guide to
1005 using and evaluating microsatellite markers. Ecol. Lett. 9 (5): 615-629.
1006 Serrano, X., I. B. Baums, K. O'Reilly, T. B. Smith, R. J. Jones et al., 2014 Geographic
1007 differences in vertical connectivity in the Caribbean coral Montastraea cavernosa
1008 despite high levels of horizontal connectivity at shallow depths. Mol. Ecol. 23
1009 (17): 4226-4240.
1010 Shearer, T. L., C. Gutiérrez-Rodríguez, and M. A. Coffroth, 2005 Generating molecular
1011 markers from zooxanthellate cnidarians. Coral Reefs 24 (1): 57-66.
1012 Shinzato, C., M. Hamada, E. Shoguchi, T. Kawashima, and N. Satoh, 2012a The
1013 repertoire of chemical defense genes in the coral Acropora digitifera genome.
1014 Zoolog. Sci. 29 (8): 510-517.
1015 Shinzato, C., M. Inoue, and M. Kusakabe, 2014 A snapshot of a coral “holobiont”: a
1016 transcriptome assembly of the scleractinian coral, Porites, captures a wide variety
1017 of genes from both the host and symbiotic zooxanthellae. PLoS One 9 (1):
1018 e85182.
1019 Shinzato, C., E. Shoguchi, T. Kawashima, M. Hamada, K. Hisata et al., 2011 Using the
1020 Acropora digitifera genome to understand coral responses to environmental
1021 change. Nature 476 (7360): 320-323.
1022 Shinzato, C., E. Shoguchi, M. Tanaka, and N. Satoh, 2012b Fluorescent protein candidate
1023 genes in the coral Acropora digitifera genome. Zoolog. Sci. 29 (4): 260-264.
45
1024 Siebert, S., M. D. Robinson, S. C. Tintori, F. Goetz, R. R. Helm et al., 2011 Differential
1025 gene expression in the siphonophore Nanomia bijuga (Cnidaria) assessed with
1026 multiple next-generation sequencing workflows. PLoS One 6 (7): e22953.
1027 Smit, A., R. Hubley, and P. Green, RepeatMasker Open-3.0.
1028 http://www.repeatmasker.org
1029 Smith-Keune, C., and S. Dove, 2008 Gene expression of a green fluorescent protein
1030 homolog as a host-specific biomarker of heat stress within a reef-building coral.
1031 Mar. Biotechnol. (N. Y.) 10 (2): 166-180.
1032 Soza‐Ried, J., A. Hotz‐Wagenblatt, K. H. Glatting, C. del Val, K. Fellenberg et al., 2010
1033 The transcriptome of the colonial marine hydroid Hydractinia echinata. FEBS
1034 Journal 277 (1): 197-209.
1035 Srivastava, M., O. Simakov, J. Chapman, B. Fahey, M. E. A. Gauthier et al., 2010 The
1036 Amphimedon queenslandica genome and the evolution of animal complexity.
1037 Nature 466 (7307): 720-726.
1038 Stamatakis, A., 2014 RAxML Version 8: A tool for phylogenetic analysis and post-
1039 analysis of large phylogenies. Bioinformatics 30 (9): 1312-1313.
1040 Stefanik, D. J., T. J. Lubinski, B. R. Granger, A. L. Byrd, A. M. Reitzel et al., 2014
1041 Production of a reference transcriptome and transcriptomic database
1042 (EdwardsiellaBase) for the lined sea anemone, Edwardsiella lineata, a parasitic
1043 cnidarian. BMC Genomics 15 (1): 71.
1044 Sun, J., Q. Chen, J. C. Lun, J. Xu, and J.-W. Qiu, 2013 PcarnBase: Development of a
1045 transcriptomic database for the brain coral Platygyra carnosus. Mar. Biotechnol.
1046 (N. Y.) 15 (2): 244-251.
46
1047 The Gene Ontology, C., M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein et al., 2000
1048 Gene Ontology: tool for the unification of biology. Nat. Genet. 25 (1): 25-29.
1049 Traylor-Knowles, N., B. R. Granger, T. J. Lubinski, J. R. Parikh, S. Garamszegi et al.,
1050 2011 Production of a reference transcriptome and transcriptomic database
1051 (PocilloporaBase) for the cauliflower coral, Pocillopora damicornis. BMC
1052 Genomics 12 (1): 585.
1053 Underwood, J. N., L. D. Smith, M. J. H. Van Oppen, and J. P. Gilmour, 2007 Multiple
1054 scales of genetic connectivity in a brooding coral on isolated reefs following
1055 catastrophic bleaching. Mol. Ecol. 16 (4): 771-784.
1056 Underwood, J. N., P. B. Souter, E. R. Ballment, A. H. Lutz, and M. J. H. Van Oppen,
1057 2006 Development of 10 polymorphic microsatellite markers from herbicide-
1058 bleached tissues of the brooding pocilloporid coral Seriatopora hystrix. Mol. Ecol.
1059 Notes 6 (1): 176-178.
1060 Van Dongen, S., 2000 Graph clustering by flow simulation. University of Utrecht, The
1061 Netherlands.
1062 van Oppen, M. J. H., J. Catmull, B. J. McDonald, N. R. Hislop, P. J. Hagerman et al.,
1063 2002 The mitochondrial genome of Acropora tenuis (Cnidaria; Scleractinia)
1064 contains a large group I intron and a candidate control region. J. Mol. Evol. 55
1065 (1): 1-13.
1066 van Oppen, M. J. H., A. Lutz, G. De'ath, L. Peplow, and S. Kininmonth, 2008 Genetic
1067 traces of recent long-distance dispersal in a predominantly self-recruiting coral.
1068 PLoS One 3 (10): e3401.
47
1069 Vidal-Dupiol, J., D. Zoccola, E. Tambutté, C. Grunau, C. Cosseau et al., 2013 Genes
1070 Related to Ion-Transport and Energy Production Are Upregulated in Response to
1071 CO2-Driven pH Decrease in Corals: New Insights from Transcriptome Analysis.
1072 PloS one 8 (3): e58652.
1073 Weis, V. M., 2008 Cellular mechanisms of Cnidarian bleaching: stress causes the
1074 collapse of symbiosis. J. Exp. Biol. 211 (19): 3059-3066.
1075 Weis, V. M., and D. Allemand, 2009 What determines coral health? Science 324 (5931):
1076 1153-1155.
1077 Wenger, Y., and B. Galliot, 2013 RNAseq versus genome-predicted transcriptomes: a
1078 large population of novel transcripts identified in an Illumina-454 Hydra
1079 transcriptome. BMC Genomics 14 (1): 204.
1080 Whelan, S., and N. Goldman, 2001 A general empirical model of protein evolution
1081 derived from multiple protein families using a Maximum-Likelihood approach.
1082 Mol. Biol. Evol. 18 (5): 691-699.
1083 Wood-Charlson, E. M., and V. M. Weis, 2009 The diversity of C-type lectins in the
1084 genome of a basal metazoan, Nematostella vectensis. Dev. Comp. Immunol. 33
1085 (8): 881-889.
1086
1087 FIGURE LEGENDS
1088 Figure 1. Annotation pipeline used to classify origins of each assembled transcript.
1089 A series of sequence comparisons were performed, comparing each transcript against N.
1090 vectensis rRNA [SILVA: ABAV01023297, ABAV01023333] from A. tenuis
1091 mitochondrial DNA [NCBI: NC_003522.1], A. digitifera and S. minutum gene models,
48
1092 and the NCBI non-redundant protein database (bit-score threshold of 45 for small
1093 databases; e-value threshold of 10-5 for large databases). Transcripts were assigned to
1094 categories by evaluating their similarity to each database in the order shown (see
1095 Methods for details).
1096
1097 Figure 2. Three metrics used to evaluate gene representation and assembly of
1098 complete transcripts in de novo transcriptome assemblies. (a) Percent of core
1099 eukaryotic genes (CEGMA) identified in each assembly; (b) percent of A. digitifera gene
1100 models with significant matches in each assembly; (c) median proportion of each N.
1101 vectensis proteins aligned with transcripts in each assembly (OHRhits). Grey = our
1102 transcriptome assembly compared to the respective reference for each analysis.
1103
1104 Figure 3. Distribution of functional categories (GO terms) in each transcriptome
1105 assembly. The percentage of transcripts with GO annotation for each category under the
1106 three main ontology domains was calculated for each assembly.
1107
1108 Figure 4. Predicted taxonomic origin of transcriptomes based on homology searches
1109 with BLAST. The percent of transcripts that were assigned to rRNA (purple), mtDNA
1110 (blue), dinoflagellate (green), metazoan (pink), other taxa (orange) and no match (grey)
1111 are shown.
1112
1113 Figure 5. Discordance in maximum likelihood phylogenetic reconstruction of COI
1114 compared to a combined phylogeny of concatenated ND (2, 4 and 5) genes and two
49
1115 phylogenomic trees. The COI phylogeny is presented on the left and the combined
1116 phylogeny is presented on the right. Topology for the ND mitochondrial set, relaxed and
1117 conservative phylogenomic trees were nearly identical. Therefore, nodal support is
1118 summarized on the relaxed tree (right). Bootstrap support at the nodes from left to right
1119 represents ND gene set/relaxed/conservative. If topologies differed in the summary tree,
1120 the nodal support is presented as -- next to the node. Yellow solid lines connect taxon
1121 with different positions and/or relationships between the two trees, while black dashed
1122 lines connect those with the same position and/or relationship. Reconstruction of groups
1123 in the class Anthozoa based on Kitahara et al. (Kitahara et al. 2010) are highlighted in
1124 boxes: teal= robust corals, pink = complex corals, and light blue = anemones. The names
1125 of species used in this study are emphasized by bold font. Scale bars indicate the amino
1126 acid replacements per site.
1127
1128 Figure S1 in File S1.doc. Venn diagram of shared orthologous groups. Comparision
1129 of the orthologous groups identified with FastOrtho from the four transcriptomes in this
1130 study. Total orthologous groups for each transcriptome are in parenthetical notation under
1131 the species name. S. hystrix and M. cavernosa shared the most orthologs (3,900) followed
1132 by F. scutaria and M. cavernosa (1,682).
1133
1134 Figure S2 in File S1.doc. Individual maximum likelihood trees from COI,
1135 concatenated ND genes, relaxed and conservative taxon sampling across the whole
1136 transcriptomes and genomes. The optimal COI (A), ND genes (B), relaxed (C) and
1137 conservative (D) phylogenies are presented with nodal support from 500 bootstrap
50
1138 replicates, except for the relaxed with 100 bootstrap replicates. The four transcriptomes
1139 from this study are highlighted by bold font. The scale bar beneath each tree indicates the
1140 amino acid substitutions per site.
1141
1142 TABLES
1143 Table 1. Collection sites, life history stages and symbiotic states of the four anthozoans
1144 used for transcriptome assembly.
Developmental Organism Collection Site Symbiotic State Stage Anthopleura elegantissima Seal Rock, OR Adult Aposymbiotic Fungia scutaria Coconut Island, HI Larval Aposymbiotic Montastraea cavernosa Florida Keys, FL Adult Symbiotic Seriatopora hystrix Nanwan Bay, Taiwan Adult Symbiotic 1145
1146 Table S1 in File S1.doc. Oligonucleotide primers used in sample preparation for
1147 Illumina sequencing.
1148
1149 Table S2 in File S1.doc. Genomic and transcriptomic datasets used for ortholog
1150 identification and phylogenetic analyses.
1151
1152 Table S3 in File S1.doc. Cytochrome oxidase subunit I (COI) sequences used in the
1153 phylogenetic analysis.
1154
1155 Table S4 in File S1.doc. Supergene set of NADH dehydrogenase transcripts used in the
1156 phylogenetic analysis.
1157
51
1158 Table S5 in File S1.doc. General transcriptome assembly and annotation statistics before
1159 and after a minimum transcript length was set to 400bp.
1160
1161 Table S6 in File S2.xls. Compiled annotation for A. elegantissma transcriptome
1162 including transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA,
1163 mitochondrial DNA or taxa origin from local and NCBI database searches.
1164
1165 Table S7 in File S3.xl. Compiled annotation for F. scutaria transcriptome including
1166 transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA, mitochondrial
1167 DNA or taxa origin from local and NCBI database searches.
1168
1169 Table S8 in File S4.xls. Compiled annotation for M. cavernosa transcriptome including
1170 transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA, mitochondrial
1171 DNA or taxa origin from local and NCBI database searches.
1172
1173 Table S9 in File S5.xls. Compiled annotation for S. hystrix transcriptome including
1174 transcript ID, UniProt, GO and KEGG annotation, and ribosomal RNA, mitochondrial
1175 DNA or taxa origin from local and NCBI database searches.
1176
1177 Table S10 in File S1.doc. Comparison of gene searches by name search and reciprocal
1178 BLAST. Bit-score cutoffs were set to 45 and taxonomic annotations were designated
1179 based on our taxonomic screen (Figure 1).
1180
52
1181 Table S11 in File S6.xls. Primers designed for potential SSR markers from each species
1182 in this study.
1183
1184 Table S12 in File S7.xls. Orthologs used in relaxed (≥ 10 taxa) and conservative (≥ 14
1185 taxa) phylogenomic analyses.
1186
53
1187 Figure 1.
1188
54
1189 Figure 2.
1190
1191
55
1192 Figure 3.
1193
1194
56
1195 Figure 4.
1196
1197
57
1198 Figure 5.
1199
58