ÔØ ÅÒÙ×Ö ÔØ

Sequencing and De Novo assembly of visceral mass transcriptome of the Critically Endangered land snail myomphala: Annotation and SSR Discovery

Se Won Kang, Bharat Bhusan Patnaik, Hee-Ju Hwang, So Young Park, Jong Min Chung, Dae Kwon Song, Hongray Howrelia Patnaik, Jae Bong Lee, Soonok Kim, Hong Seog Park, Seung-Hwan Park, Yeon Soo Han, Jun Sang Lee, Yong Seok Lee

PII: S1744-117X(16)30080-6 DOI: doi:10.1016/j.cbd.2016.10.004 Reference: CBD 432

To appear in: Comparative Biochemistry and Physiology - Part D: Genomics and Proteomics

Received date: 24 January 2016 Revised date: 24 October 2016 Accepted date: 26 October 2016

Please cite this article as: Kang, Se Won, Patnaik, Bharat Bhusan, Hwang, Hee-Ju, Park, So Young, Chung, Jong Min, Song, Dae Kwon, Patnaik, Hongray Howrelia, Lee, Jae Bong, Kim, Soonok, Park, Hong Seog, Park, Seung-Hwan, Han, Yeon Soo, Lee, Jun Sang, Lee, Yong Seok, Sequencing and De Novo assembly of visceral mass transcriptome of the Critically Endangered land snail : Annotation and SSR Dis- covery, Comparative Biochemistry and Physiology - Part D: Genomics and Proteomics (2016), doi:10.1016/j.cbd.2016.10.004

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ACCEPTED MANUSCRIPT

Sequencing and De Novo assembly of visceral mass transcriptome of the Critically Endangered land snail Satsuma myomphala: Annotation and SSR Discovery

Se Won Kang 1†, Bharat Bhusan Patnaik 1,2,†, Hee-Ju Hwang 1, So Young Park 3, Jong Min Chung 1, Dae Kwon Song 1, Hongray Howrelia Patnaik 1, Jae Bong Lee 4, Changmu Kim 5, Soonok Kim 5, Hong Seog Park 6, Seung-Hwan Park 7, Yeon Soo Han 8, Jun Sang Lee 9, Yong Seok Lee 1,*

1 Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungcheongnam-do 31538, Korea. 2 Trident School of Biotech Sciences, Trident Academy of Creative Technology (TACT), Chandaka Industrial Estate, Chandrasekharpur, Bhubaneswar, Odisha 751024, India. 3 Nakdonggang National Institute of Bioloogical Resources, Biodiversity Conservation and Climate Change Division, 137, Donam-2-gil, Sangju-si, Gyeongsangbuk-do, 37242, Korea 3 Korea Zoonosis Research Institute (KOZRI), Chonbuk National University, 820-120 Hana-ro, Iksan, Jeollabuk-do 54528, Korea. 4 National Institute of Biological Resources, 42, Hwangyeong-ro, Seo-gu, Incheon 22689, Korea. 5 Research Institute, GnC BIO Co., LTD. 621-6 Banseok-dong, Yuseong-gu, Daejeon 34069, Korea. 6 Biological Resource Centre, Korean Research Institute of Bioscience and Biotechnology (KRIBB), Jeongeup,ACCEPTED Korea MANUSCRIPT 7 College of Agriculture and Life Science, Chonnam National University 77 Yongbong-ro, Buk- gu, Gwangju 61186, Korea. 8 Institute of Environmental Research, Kangwon National University, 1 Kangwondaehak-gil, Chuncheon-si, Gangwon-do 243341, Korea.

† These authors contributed equally to this work.

Running title: Transcriptome of Critically Endangered Satsuma myomphala.

Ms. Has 76 pages, 9 figures, 6 tables, 2 suppl. files

* Corresponding author:

1

ACCEPTED MANUSCRIPT

Dr. Yong Seok Lee

E-Mail: [email protected]; Tel.: +82-10-4727-5524; Fax: +82-41-530-1256.

Abstract

Satsuma myomphala is critically endangered through loss of natural habitats, predation by natural enemies, and indiscriminate collection. It is a protected species in Korea but lacks genomic resources for an understanding of varied functional processes attributable to evolutionary success under natural habitats. For assessing the genetic information of S. myomphala, we performed for the first time, de novo transcriptome sequencing and functional annotation of expressed sequences using Illumina Next-Generation Sequencing (NGS) platform and bioinformatics analysis. We identified 103,774 unigenes of which 37,959, 12,890, and

17,699 were annotated in the PANM (Protostome DB), Unigene, and COG (Clusters of

Orthologous Groups) databases, respectively. In addition, 14,451 unigenes were predicted under

Gene Ontology functional categories, with 4581 assigned to a single category. Furthermore,

3369 sequences with 646 having Enzyme Commission (EC) numbers were mapped to 122 pathways in the KyotoACCEPTED Encyclopedia of Genes MANUSCRIPT and Genomes Pathway database. The prominent protein domains included the Zinc finger (C2H2-like), Reverse Transcriptase, Thioredoxin-like fold, and RNA recognition motif domain. Many unigenes with homology to immunity, defense, and reproduction-related genes were screened in the transcriptome. We also detected 3120 putative simple sequence repeats (SSRs) encompassing dinucleotide to hexanucleotide repeat motifs from >1 kb unigene sequences. A list of PCR primers of SSR loci have been identified to study the genetic polymorphisms. The transcriptome data represents a valuable resource for

2

ACCEPTED MANUSCRIPT

further investigations on the species genome structure and biology. The unigenes information and microsatellites would provide an indispensable tool for conservation of the species in natural and adaptive environments.

Keywords: Satsuma myomphala; Illumina sequencing; De novo annotation; Transcriptome; Functional annotation; Simple sequence repeats

1. Introduction

The among invertebrates comprise the second most habituated species phylum with over 100,000 described members under eight major lineages (Haszprunar et al., 2008). The ecological diversity of Molluscs show the most conspicuous organisms, with a majority inhabiting the marine environment, extending from intertidal to deepest oceans. The high diversity and abundance is a feature of terrestrial molluscs wherein as many as 60-70 species may coexist in a single habitat. Molluscs belonging to class form the largest group with more than 62,000 described living species. Gastropod molluscs have succeeded to occupy almost all the ecological habitats due to diversity in body, size and shell morphology. These species have been used as models for investigations into ecological, behavioral, biomechanical, physiological, and phylogeneticsACCEPTED attributes ofMANUSCRIPT relevance (Kocot et al., 2011; Bourdeau et al., 2015). The land snails, in particular, show numerous special adaptations to particular habitats, and hence are one of the successful groups on the earth.

The land gastropods of family currently comprises of 87 genera majorly distributed in the tropics of Eastern Asia and Australasia (Cuezzo, 2003; Kohler, 2010). The morphological attributes of Camaenidae snails with reference to shell and genital tracts have been found useful in understanding the taxonomic position in genera inhabiting Western

3

ACCEPTED MANUSCRIPT

Australia (Kohler and Criscione, 2015). Among the taxa of Camaenidae snails with main occurrence in South-East Asia, the genus Satsuma includes three species namely, Satsuma japonica, Satsuma myomphala, and Satsuma omphaloides. A karyotype analysis of S. japonica and S. omphaloides have been reported showing dissimilarities in sizes of chromosomes

(Tatewaki and Kitada, 1987). S. myomphala is a terrestrial pulmonate gastropod found in Japan.

However, in Korea the snail species has been known only from Island Geoje in Gyeongsangnam- do inhabiting humid and bushy sub-tropical forests. The extent of occurrence and the area of occupancy of the species are very narrow, and the living individual has been observed only once.

Due to trampling, indiscriminate collection, predation from natural enemies, and loss of natural habitats the living individuals of the species have been threatened. This species has been assessed as Critically Endangered in Korea and classified as monitored species under Korea Red

List (KORED) Assessment. Due to the lack of sample resource and genomic information, sufficient progress towards a study on the species phylogenomics and phylogeography has been hindered. Added to the limited genomic resources, the phenotypic screens for measurable traits related to local adaptation, immune system, reproductive processes and pathways in S. myomphala remainsACCEPTED poorly explored. With regardsMANUSCRIPT to the National Centre for Biotechnology Information (NCBI) database entry, the mitochondrion cytochrome oxidase subunit I gene sequence is known for the species. Furthermore, the molecular markers relevant to specific traits, required to support the breeding initiatives has not been explored in S. myomphala. For the protection of the species in the wild and for successful marker-assisted breeding programmes, a comprehensible understanding of the genetic background seems necessary. Transcriptomics is a fast, reliable, and cost-efficient method that can explore the genetic resource information in S. myomphala by annotation of candidate genes involved in metabolism, immune and reproductive

4

ACCEPTED MANUSCRIPT

pathways. The method would also assist to identify the molecular markers in the gene sequences of S. myomphala, and hence would be utilized as a tool to take stock of selective pressures in the species genome due to anthropogenic activities. The Next-generation sequencing (NGS) technologies have provided the whole-genome and transcriptome information reliably on a regular basis (Metzker, 2010). The non-model ecologically relevant species have been sequenced using NGS technologies providing rich insights to adaptive features and functional processes involved in immunity and reproduction (Che et al., 2014; Yue et al., 2015). Basically, NGS technologies have been revolutionary in study of gene expression analysis (replacing the use of microarray-based analysis), whole-genome sequencing, targeted re-sequencing, and DNA methylation (Morozova and Marra, 2008; Feldmeyer et al., 2011; Zeng et al., 2013; Patnaik et al.,

2015; Patnaik et al. 2016a). Especially, the Illumina NGS platform has been the preferred sequencing method due to an enhanced efficiency with shorter reads at a much lesser cost

(Finseth and Harrison, 2014). The de novo transcriptome of molluscan species such as the

Eastern Oyster, Crassostrea virginica (Zhang et al., 2014), Octopus vulgaris (Castellanos-

Martinez et al., 2014), Nerita melanotragus (Amin et al., 2014), Clithon retropictus (Park et al. 2016), BiomphalariaACCEPTED glabrata (Delury et al., 2012)MANUSCRIPT and land snail species such as Theba pisana (Adamson et al., 2015), have been generated using Illumina NGS platform.

In this study, we have used the Illumina HiSeq 2500 NGS platform to generate the visceral mass transcriptome of S. myomphala. The paired-end sequences were assembled using

Trinity de novo program to generate the unigenes. The unigenes were annotated with public databases using BLASTX, and subsequently functionally classified using Gene Ontology (GO),

Clusters of Orthologous Groups (COG), Kyoto Encyclopedia of Genes and Genomes (KEGG),

5

ACCEPTED MANUSCRIPT

and InterPro domain analysis. We identified unigenes related to innate immunity, defense mechanisms, and reproduction from the transcriptome analysis. In addition, we screened simple sequence repeats (SSR) located within coding regions that could be utilized for gene polymorphism studies.

2. Material and methods

2.1. Sample preparation

Specimens of S. myomphala (four individuals) were collected in the forest near Galgot-ri,

Nambu-myeon, Geoje-si, Gyeongsangnam-do, Korea on 18th August, 2014. No ethics or permission certificate was required for collection as it is in the list of protected species of

National Institute of Biological Resources (NIBR), Incheon, Korea. The snails were brought to the laboratory with care and were rinsed in ddH2O to eliminate associated mud attached to the shell. Housing, care, and collection of visceral mass tissue for use in sequencing experiments were conducted adhering to the International Guiding Principles for Biomedical Research

Involving (1985) (http://www.ncbi.nlm.nih.gov/books/NBK25438/). Visceral mass tissue (solid mass ofACCEPTED tissue bathing the internal MANUSCRIPT organs) was dissected out of the snails. All the four individual snails were considered from the same life stage to maintain uniformity in the unigene expression and annotation results. The samples were homogenized in TRIzol reagent

(Invitrogen, Carlsbad, CA) following manufacturer‟s protocol, and stored at -80oC till further use.

The quantity and quality of isolated RNA was confirmed using a NanoDrop spectrophotometer

(Thermo Scientific, Wilmington, DE) and Agilent 2100 Bioanalyzer (Agilent Technologies). A sample RIN (RNA integrated number) value of 7.0 was chosen as the standard for Illumina- based transcriptome sequencing. Residual DNA was removed by RNase-free DNase I (Qiagen)

6

ACCEPTED MANUSCRIPT

for 30 min at 37oC. For cDNA library construction, approximately 5 µg of total RNA pooled from the four sampled individuals were used as an input material.

2.2 Illumina cDNA library construction and sequencing

One paired-end (PE) cDNA library was constructed from the total RNA of the visceral mass tissue of S. myomphala using the mRNA-seq sample preparation kit. Briefly, oligo (dT) beads were used to isolate poly (A) mRNA. The mRNA was fragmented using the fragmentation buffer to obtain short fragments prior to cDNA synthesis. The short fragments served as templates for first-strand cDNA synthesis using random-hexamer primers. Second-strand cDNA was synthesized using RNase H (Invitrogen) and DNA polymerase I (New England BioLabs,

Ipswich, MA, USA). Hence, double-strand cDNA was prepared that was end-repaired using the

T4 DNA polymerase, the Klenow fragment, and the T4 polynucleotide kinase (New England

BioLabs). Subsequently, the adapters were ligated to the end-repaired cDNA fragments using T4

DNA ligase (New England BioLabs) at room temperature for 15 min. Finally, PCR amplification and purification resulted in cDNA libraries, approximately 200 bp in size. The cDNA library was sequenced on the IlluminaACCEPTED HiSeq 2500 NGS platformMANUSCRIPT (GnC Bio Company, Daejeon, Korea) with the generation of 100-base pair (bp) PE reads. Raw sequence reads were exported in FASTQ format and deposited at the National Centre for Biotechnology Information (NCBI) short read archive (SRA) under the BioProject accession number SRP064883. The datasets can be and the assembled contigs information can be downloaded on or after 16-10-2016 (exactly one year after the date of submission) from http://bioinfo.sch.ac.kr/submission/.

2.3. Analysis of sequencing reads and de novo assembly

7

ACCEPTED MANUSCRIPT

The raw sequencing reads were filtered in order to generate quality reads for de novo assembly. For filtering adapter-only sequences (number of nucleotides of recognized adapter ≤13, and number of nucleotides excluding the adapter ≤35), we used the command-line tool Cutadapt

(http://code.google.com/p/cutadapt/) with default parameters (for paired-end reads: -a

ADAPT1 –A ADAPT2; -o out1. fastq –p out2. fastq in1. fastq in2. fastq). Next, low-quality reads were filtered based on quality scores (≥20) using the base-calling program Phred (Ewing and Green, 1998; Ewing et al., 1998). Finally, biases based on GC content were removed to get quality reads for an accurate de novo assembly. Generally, random hexamer priming introduces

GC content bias in the first 13 bases of Illumina RNA-Seq reads (Hansen et al., 2010), potentially affecting the quality of the assembly. The clean reads obtained were processed with the short reads assembling program Trinity Release v2.1.1 (Grabherr et al., 2011) with a K-mer of 25 and a minimum allowed length of 200 nucleotides. The K-mer chosen was appropriate as smaller K-mer results in complex de Bruijin graphs and large K-mers can result in low sequencing depth at poorly overlapped regions. The Trinity assembly generated a set of sequences that contains a certain length of overlap without gaps, called contigs. The TIGR Gene Indices clustering toolsACCEPTED (TGICL) program (Pertea MANUSCRIPT et al., 2003) was utilized for clustering of contigs to sequences without Ns and which could not be extended on either end were defined as unigenes.

2.4. Transcriptome annotation and functional categorization

The unigenes from the error corrected assembly and clustering process were annotated against publically available protein databases such as PANM (Protostome DB), COG (Clusters of Orthologous Groups), GO (Gene Ontology), KEGG (Kyoto Encyclopedia of Genes and

8

ACCEPTED MANUSCRIPT

Genomes) using BLASTX and Unigene nucleotide database using BLASTN algorithm (E-value cutoff of 1e-5). PANM-DB is a local curable database of protein sequences representing the

Protostome groups (Mollusca, Arthropoda and Nematoda) and is found more efficient than the

NCBI non-redundant (nr) databases in terms of speed and the number of annotation hits (Kang et al., 2015). The annotation of unigene sequences against PANM, Unigene, and COG DB were represented using a three-way Venn diagram plot constructed using Venny 2.0

(http://bioinfogp.cnb.csic.es/tools/venny/). Homology assignments and top-hit species distribution of the unigene sequences were performed against PANM-DB using BLASTX program. Functions of the sequences were predicted by query of COG database at NCBI

(http://www.ncbi.nlm.nih.gov/COG/) using a cutoff E-value of ≤ 1e-5. For GO term classification (http://www.geneontology.org) and distribution of unigenes into three ontologies; biological process, cellular component, and molecular function (Ashburner et al., 2000), we employed the BLAST2GO program. InterProScan at BLAST2GO was used to annotate unigenes based on conserved protein domain search (Quevillon et al., 2005).

We used a keywordACCEPTED search of the annotated MANUSCRIPT unigenes against PANM-DB to identify the candidate S. myomphala genes involved in immunity and reproduction. We used a set of representative key terms composed of a series of innate immunity and oxidative stress genes, to predict genes related to immunity. Similarly, representative reproduction-related genes were used as keywords to search for unigenes putatively related to reproduction. In addition, we used the

GO terms such as “immune system process”, “response to stimulus”, “signaling”, and

“antioxidant activity” process for predicting key immune genes. Also we used the GO terms such

9

ACCEPTED MANUSCRIPT

as “reproduction” and “reproductive process” for selecting candidate genes related to reproduction in the land snail species.

2.5. SSR discovery and primer design

In this study, we have used the unigene sequences having a length of ≥1 kb for the screening of simple sequence repeats. A motif length between two and six nucleotides in size were used. The minimum repeat numbers were six, five, four, four, and four for motif lengths of two, three, four, five, and six nucleotides, respectively. A motif length of one nucleotide were not considered due to frequent homopolymer generation in Illumina sequencing. We used the

MicroSAtellite (MISA; http://pgrc.ipk-gatersleben.de/misa/) program to search for SSRs in the unigenes of S. myomphala. Additionally, we identified a set of primer sequences from SSR flanking regions using BatchPrimer 3 program (You et al., 2008). The criteria used for primer identification were: dinucleotides with ten or more repeats and tri- and tetranucleotide repeats with seven or more repeats.

3. Results and DiscussionACCEPTED MANUSCRIPT 3.1. Transcriptome sequencing and assembly We characterized the visceral mass transcriptome of a Camaenidae family land snail, S. myomphala, critically endangered in Korea, using the Illumina HiSeq 2500 NGS platform. Each sequencing lane generated 2 x 50 nucleotides independent reads from either end of a cDNA fragment. Approximately, 264.5 million raw reads were obtained comprising of 33,320,993,328 bases. The raw reads were processed for quality assessment that includes adapter trimming and quality filtering. The data processing parameters for adapter trimming is given in Table S1. We

10

ACCEPTED MANUSCRIPT

obtained 99.83% reads after the trimming analysis with an average length of 125.8 bp. After data filtering based on quality, approximately 261 million high quality reads were obtained with

32,366,103,440 clean nucleotides. The clean reads represents 98.67% of raw reads and 97.13% of total bases obtained from the Illumina sequencer. The mean length, N50 length, and GC% of the clean reads were 124 bp, 126 bp, and 48.02% (with 100% Q20 bases), respectively. The clean reads were processed with Trinity de novo program to assemble the short-reads overlapping contiguous sequences called contigs. A total of 149,251 contigs (approximately 78.3 million nucleotides) with a mean and N50 length (i.e., 50% of the total assembled sequences were contained within contigs of this length or longer) of 524.1 bp and 606 bp were generated by the assembly. A total of 45,726 contigs (30.64%) were longer than 500 bp with the longest contig of size 11,186 bp. The contigs were subsequently clustered into 103,774 unigenes with a mean length, N50 length and GC% of 571.7 bp, 705 bp, and 42.86%, respectively. The shortest unigene was 133 bp and the longest was 13,172 bp. The summary statistics of S. myomphala transcriptome assembly is shown in Table 1. The mean length of contigs in this study was greater than the contig assembly in the snail Echinolittorina malaccana (Wang et al., 2014), pearl oyster Pinctada maxima (DengACCEPTED et al., 2014), and Anadara MANUSCRIPT trapezia (Prentis and Pavasovic, 2014) that also utilized Illumina sequencing platform and Trinity de novo assembly. Conversely, the mean length and N50 length in this study was found to be smaller when compared with the contig assembly of a freshwater pearl bivalve, Cristaria plicata, sequenced using Illumina HiSeq 2500

(Patnaik et al., 2016b).

The size distribution of assembled contigs and unigenes have been shown in Fig. 1A and

Fig. 1B, respectively. Among the contigs, a larger fraction (69.46%) of the assembled sequences

11

ACCEPTED MANUSCRIPT

had sizes ≤500 bp. A total of 30,422 (20.38%), 12,195 (8.17%), and 2971 (1.99%) contigs had sizes in the ranges of 501–1000 bp, 1001–2000 bp, and >2001 bp, respectively. Among the unigenes, 67,445 unigenes (64.99%) were no longer than 500 bp in length, 23,193 unigenes

(22.35%) were in the length range of 501 to 1000 bp, 10,328 unigenes (9.95%) were of length ranging from 1001 to 2000 bp, and 2808 unigenes (2.71%) were longer than 2001 bp.

3.2 Assembly evaluation and annotation

The assembly derived unigenes were searched against public protein and nucleotide databases for validation and annotation based on BLASTX and BLASTN programs (E-value of

≤ 1e-5). The summary statistics for BLAST assignment against PANM, Unigene, COG, GO, and

KEGG databases are shown in Table 2. As indicated, out of 103,774 unigenes, 37,959 (36.58%),

17,699 (17.06%), and 12,890 (12.42%) unigenes showed annotation hits to homologous sequences in PANM, COG, and Unigene databases, respectively. Besides the same, 14,451 and

1453 unigenes showed homology matches in GO and KEGG database respectively. In all, we find a total of 41,245 (39.75%) assembled unigene sequences showing annotation hits to atleast one of the protein andACCEPTED nucleotide databases usedMANUSCRIPT in this study. The unigenes ranging in length from 300–1000 bp showed the greatest match percentages followed with unigenes having lengths of ≥1000 bp. The unannotated sequences may represent short sequences lacking a conserved protein domain. Furthermore, the transcriptome may also contain incompletely spliced introns, orphaned untranslated regions (UTRs), non-coding genes, and random transcriptional noise that may not show homology to available sequences in the public databases. In addition, an absence of annotations from transcriptome would also mean genes expressed at low levels or not being expressed at the time of RNA extraction. Moreover, the annotation against PANM database

12

ACCEPTED MANUSCRIPT

showed the maximum hits as it contains protein sequences from only the protostome group

(Mollusca, Arthropoda and Nematoda). The database is the latest version of Mollusks database

(Kang et al., 2014) and is found efficient over NCBI non-redundant (nr) in terms of time and efficiency of annotation hits (Kang et al., 2015). A three-way Venn diagram plot (Fig. 2) also show the representation of assembled unigenes annotated against PANM, Unigene, and COG databases. This shows that 7179 unigenes were concurrently annotated by all three databases, while 1 unigene had homologous sequence both in Unigene and COG databases. Besides these unigenes, 10,516 unigenes were annotated both in PANM and COG and 2711 unigenes annotated both in PANM and Unigene databases.

Considering the maximum annotation hits to PANM database, we studied the homology characteristics of BLASTX hits for the annotated unigenes in the database. The E-value distribution indicated that 69.74% of annotated unigenes have an E-value in the range of 1E-50 –

1E-5, followed with 16.18% of unigenes with an E-value distribution of 1E-100 – 1E-50 (Fig.

3A). The identity distribution plot of annotated unigenes against PANM database showed that a majority of unigenesACCEPTED (38.6%) show an identity MANUSCRIPT of 40–60%, followed with 25.29% unigenes with an identity of 60–80% and 23.49% unigenes with an identity of 13–40% (Fig. 3B). The similarity distribution shows that 41.96% of the unigenes had a top-hit similarity of 60–80%, while 29.06% of unigenes had a similarity of 80–100% (Fig. 3C). In addition, our results also confirms the fact that shorter unigenes are less represented in sequence annotation hits, and that the annotation hits increases with an increase in the length of the unigenes. We found close to 80% of sequence hits for sequences longer than 2001 bp (Fig. 3D). Generally, the conserved protein domains representative of a putative function and consequent annotation hit are found in longer sequences,

13

ACCEPTED MANUSCRIPT

while the shorter sequences lack them (Che et al., 2014). We also find that among the annotated unigenes in PANM database, the top-hit species was Aplysia californica with 51.72% of unigenes (Fig. 4). A. californica is an aquatic gastropod with the most extensive genomic information. In fact, the species was the first mollusc to be sequenced at the genome level and the information has led to rich insights into invertebrate evolution, developmental biology and neurobiology of molluscs (Choi et al., 2010; Fiedler et al., 2010; Heyland et al., 2011). Other top hit species with sufficient genomic resources included the freshwater snail, Lottia gigantea

(Simakov et al., 2013) and the oyster, Crassostrea gigas (Zhang et al., 2012). Another valid point regarding the top-hit species distribution using BLASTX annotation is that the procedure is biased towards the completeness of genome annotations for each respective genome within searched databases (Shaw et al., 2012). The inherent problems of annotation by such procedure involving protein databases has been bypassed by extensive reports and suggests its utility as the best annotation method for non-model species (Chen et al., 2015; Richardson and Sherman, 2015;

Patnaik et al., 2015).

3.3 COG, GO, KEGGACCEPTED, and InterPro domain classifications MANUSCRIPT The assembled unigene sequences from S. myomphala transcriptome were classified based on predicted functions in the COG database. The 17,699 unigenes that annotated to the

COG database were classified for putative functions under 25 categories, excluding the „multi: more than one functional class‟ category (Fig. 5). Of the 25 COG categories, the „general function prediction only‟ category formed the largest group, followed by „signal transduction mechanisms‟, „function unknown‟, and „post-translational modification, protein turnover and chaperones‟ category. Other COG categories to which the unigenes were classified accounted for

14

ACCEPTED MANUSCRIPT

less than 5% of assembled sequences. The smallest COG groups to which the unigenes were mapped were „cell motility‟ and „nuclear structure‟. Generally, these COG categories are included under four COG classes: Metabolism, Information storage and processing, Cellular processes and signaling and Poorly characterized genes. In our study, the unigenes were predominately directed to the cellular processes and signaling class (26.81%), followed by the poorly characterized (24.1%), and metabolism (16.3%) classes. This is in contrast to the COG classification in another land snail species, Theba pisana, wherein the predominant COG class was information storage and processing, followed by cellular processes and signaling and metabolism (Adamson et al., 2015).

A GO classification, which is an international standard, suggests that a gene is grouped to those of known (or predicted) function, but is not, strictly speaking, evidence of functionality.

The GO database annotated unigenes (14,451) were classified in terms of their associated biological process, cellular component and molecular function (Fig. 6A). Among the annotated unigenes, 4923, 637, and 623 sequences were classified with molecular function, cellular component, and biologicalACCEPTED process, respectively. MANUSCRIPT In addition, 4683, 636, and 396 unigenes were suggested to associate with both biological process and molecular function, biological process and cellular component, and cellular component and molecular function, respectively. About

2554 annotated unigenes were associated to all the three GO categories. This suggests that there is a strong overlap of associated functions for unigenes under biological process and molecular function category. Furthermore, we find that only 4581 unigenes (31.7%) are suggested to be classified under a single GO subcategory. All other unigenes were grouped to more than one subcategory at level 2 of GO term classifications. A relationship of the number of unigenes

15

ACCEPTED MANUSCRIPT

ascribed to number of GO terms has been shown in Fig. 6B. The mapping of associated GO terms to the annotated unigenes under the biological process, cellular component, and molecular function category has been depicted in Fig. 7. The top 10 GO term assignments for biological process were metabolic process (4407), cellular process (4152), and single-organism process

(3390) (Fig. 7A). Other significant GO terms under the category included the response to stimulus (907) and signaling (660). For cellular component, the top represented GO terms included cell (1948), membrane (1838), and organelle (1345) (Fig. 7B). In the molecular function category, the top GO term associated with the annotated unigenes included the binding

(6885) and catalytic activity (4318) (Fig. 7C). Although not represented within the top 10 GO terms, the mapping of unigenes to GO processes such as reproduction (20), immune system process (15), and reproductive process (13) under biological process category has also been observed. We believe the GO ontology classification in the present study should be considered on the basis of evidence codes associated with each GO terms. This is due to the fact that a larger fraction of such annotations in biological databases are based on computational methods that have no experimental validity, and hence could increase the number of false positives. The evidence code resultsACCEPTED in the present study also MANUSCRIPT showed that a majority of terms are assigned the code IEA (inferred from electronic annotations) that are sourced on computational methods. Certain low-quality terms with evidence codes such as NAS (non-traceable author statement) or

ND (no biological data available) have to be controlled for to put annotation results in perspective (Rhee et al., 2008; Rogers and Ben-Hur, 2009). We emphasize the points that any interpretation of unigene function is predicted and not all GO terms are of equal validity.

16

ACCEPTED MANUSCRIPT

KEGG Orthology (KO) mapping provides an alternative functional classification of the assembled sequences based on canonical pathways in KEGG (Kanehisa et al., 2004). We find that 3369 unigene sequences were assigned to 122 KEGG pathways (Fig. 8). A certain number of sequences (646) are found to share homology with known enzymes in the KEGG database.

The most predominant category was metabolism (3233), followed with immune system pathway

(62). Among the metabolism pathway category, nucleotide metabolism (878) pathway was thickly populated with unigene association followed with metabolism of cofactors and vitamins

(554) and carbohydrate metabolism (417). The unigenes predicted under immune system process were ascribed to T-cell receptor signaling pathway and the unigenes predicted under environmental information processing were ascribed to phosphatidylinositol signaling system and mammalian target of rapamycin (mTOR) signaling pathway.

To provide further the predicted functions of unigenes, we searched the assembled unigenes against InterProScan database using the BLAST2GO software. The InterProScan conserved protein domain search is considered indispensable in BLAST annotation mapping to identify the functionalACCEPTED units. A total of 15,742 MANUSCRIPT unigene sequences (15.17%) were annotated in InterProScan and generated 5928 domains. A summary of the top 25 protein domains archived by InterProScan for S. myomphala unigene assembly have been illustrated in Table 3. The results showed that “Zinc finger, C2H2-like‟ with top-hits from 518 unigenes were the most conserved domain followed by “P-loop containing nucleoside triphosphate hydrolase domain”, and

“Reverse transcriptase domain”. Other notable protein domains conspicuous in the unigene annotation process were the “RNA recognition motif”, “Ankyrin repeat”, “WD40 repeat”, “C- type lectin”, “Immunoglobulin-like fold”, “Protein kinase”, “Major facilitator superfamily”, and

17

ACCEPTED MANUSCRIPT

“Cadherin” domains (Table 3). The most abundant protein domains in the present study relates to the nucleotide binding, transmembrane, and signaling regions so critical for mediating regulatory gene processes. The transcription factors including the “Zinc-finger, C2H2-like” that mediates interaction with nucleotides and other proteins made up for a larger proportion of InterProScan annotated sequences. Prevalence of G-protein coupled receptors (GPCRs), rhodopsin-like family is expected due to their prevalence in sensory systems. GPCRs and transcription factors (zinc fingers) were also abundantly annotated in the bay scallop Argopecten irradians and sea scallop

Placopecten margellanicus eye transcriptomes (Pairett and Serb, 2013).

3.4. Candidate immunity genes in S. myomphala transcriptome

Considering the critically endangered status of S. myomphala, we found it necessary to identify genes that may be involved in innate immunity and defense of the species against biotic and abiotic stressors. The global immunity related genes have been deciphered in some molluscan species including the Eastern oyster Crassostrea virginica (Tanguy et al., 2004),

Pacific oyster Crassostrea gigas (Renault et al., 2011; Fleury and Huvet, 2012), Manila clam Ruditapes philippinarumACCEPTED (Moreira et al., 2014) MANUSCRIPT and the gastropod snail Biomphalaria glabrata (Guillou et al., 2007; Bouchut et al., 2007; Lockyer et al., 2008) using the principles of suppression subtractive hybridization (SSH) and cDNA microarray. Currently, the transcriptome-wide approach to identify the immunity and defense candidate genes have been preferred due to substantial decline in sequencing costs and improved algorithms for assembly and functional annotation. We identified partial and full-length transcripts of S. myomphala unigenes associated with these functions using a keyword search of BLASTX results against

PANM database. Both GO terms and KEGG analysis also identified transcripts related to

18

ACCEPTED MANUSCRIPT

immunity and defense in the present study as with previous studies in many non-model species

(Lockwood and Somero, 2011; Richardson and Sherman, 2015). The putative protein families identified from the annotation analysis in the present study includes the Toll-like receptors

(TLRs), Lectins, Scavenger receptors, Peptidoglycan recognition proteins, Galectins, C1q domain proteins, oxidative stress, apoptosis-related, and heat shock proteins (HSP) proteins

(Table 4). Among the pathogen recognition receptors (PRRs), TLRs form an important class that recognizes the molecular patterns in pathogens, modulating the signal transduction mechanism for efficient transcription of effector antimicrobial peptides (Halanych and Kocot, 2014). We found unigenes putatively related to TLR-2, -3, -4, -6, -7, -13, and the Toll interacting protein.

TLRs have been known from molluscs bearing a conserved domain organization (N-terminal leucine-rich repeat motifs, transmembrane region, and a cytoplasmic Toll/interleukin-1 receptor domain) (Zhang et al., 2013; Zhang et al., 2014). We found other key genes of the TLR pathway, including Myeloid differentiation factor 88 (MyD88), Tumor necrosis factor receptor-associated factor (TRAF), Cactus/IκB, Relish, p38, and Activator protein-1 (AP-1). The TLR in molluscs associates with MyD88 and passes the immune signals through the kinases to transcribe AP-1 (p38 signaling pathway)ACCEPTED (Zhang et al., 2014). MANUSCRIPTThe TLR signaling components have been known from C. farreri (Qiu et al., 2007), P. fucata (Wu et al., 2007), and the abalone Haliotis diversicolor supertexta (Jiang and Wu, 2007). Among other immune repertoire, the carbohydrate-recognition domain (CRD) lectins show variable specificities in recognizing the pathogens. CRD-galectins (especially the multimeric CRD galectins) have been identified from

C. virginica (Tasumi and Vasta, 2007) and Pinctada fucata (Wang et al., 2011). The exact role of scavenger receptors in molluscan immunity is still not known, although their involvement in pathogen recognition and establishment of infection cannot be undermined in other invertebrates

19

ACCEPTED MANUSCRIPT

(Mukhopadhyay and Gordon, 2004). Short PGRPs, very similar to present study, have been identified from scallop A. irradians (Ni et al., 2007) and Chlamys farreri (Su et al., 2007). The

C1q domain proteins show remarkable abundance in molluscs with the identification of 168 and

187 transcript sequences in Mytilus galloprovincialis (Gerdol et al., 2011) and Crassostrea virginica (Zhang et al., 2014), respectively.

Among the immune effectors, lysozyme (G-type) was identified in the present study but defensins were missing. The G-type lysozyme has also been reported from scallop C. farreri that shows gram-positive bacteria specific lytic activities (Zhao et al., 2007). The oxidative stress defense genes identified from the unigene profile of S. myomphala includes the stress response enzymes such as catalase, superoxide dismutase (SOD), glutathione peroxidase (GPx), glutathione-S-transferase (GST) and glutathione synthetase. In the invasive mussel, Limnoperna fortunei, similar antioxidant gene profile was noticed (Uliano-Silva et al., 2014) that may be responsible for adaptations to different climate zones and habitats or establish an indigenous species in a defined environmental condition. We also revealed several apoptosis pathway genes in the analysis of S. ACCEPTEDmyomphala transcriptome MANUSCRIPT including Tumor necrosis factor (TNF), caspases, Bcl-2, and Bax suggesting an efficient apoptosis regulatory pathway.

3.5. Candidate reproduction genes in S. myomphala transcriptome

The reproduction genes repertoire were identified in S. myomphala transcriptome towards utilization of the information for controlled reproduction strategies for conservation of the species. We identified many unigenes related to spermatogenesis associated protein, sperm associated antigens, enzymes such as spermine synthase and oxidase, vitellogenin-1, and

20

ACCEPTED MANUSCRIPT

vitellogenin receptor among others (Table 5). The majority of male reproduction-related genes are associated with sperm motility, which may be a crucial revelation for the reproductive success of the species. In another study on the endangered Chinese salamander Hynobius chinensis, key genes related to male and female reproductive tissues were annotated (Che et al.,

2014). Such proteins were also found differentially upregulated in the testes tissue of sea urchin

Arbacia lixula (Perez-Portela et al., 2015). Vitellogenin have been isolated from molluscs such as C. gigas (Suzuki et al., 1992), Mytilus edulis (Puinean and Rotchell, 2006), Haliotis discus hannai (Matsumoto et al., 2008), Chlamys farreri (Qin et al., 2012), Chlamys nobilis (Zheng et al., 2012) and very recently from the transcriptome analysis of C. plicata (Patnaik et al., 2016b).

Vitellogenin is an important protein in the oocyte maturation in molluscan species and hence show strong presence in the ovary. The reproduction-related genes in the present study represents novel molecular information regarding S. myomphala reproduction.

3.6. Characterization of SSRs and PCR primers

Conservation of threatened and endangered species through the application of microsatellite markersACCEPTED have seen reasonable MANUSCRIPT growth with the routine application of NGS technologies to sequence non-model species (Abdul-Muneer, 2014). Microsatellites such as SSRs or STRs (short tandem repeats) are increasingly been used to study the population structure dynamics in rare, endemic, and threatened biodiversity (Abdul-Muneer et al., 2009; Yue et al.,

2015). SSRs and SNPs (single nucleotide polymorphisms) are the basis of comparative genomics analysis, genetic diversity assessment, and genetic linkage mapping that in turn are used for marker assisted selection programs (Liu and Cordes, 2004; Arias-Perez et al., 2012). Considering the advantages of transcriptome derived microsatellites (greater transferability across related

21

ACCEPTED MANUSCRIPT

species, linked to genes of interest, lower cost of development, and higher quality), we made an attempt to screen the SSRs from the assembled unigenes of S. myomphala. To screen out high precision SSRs, we examined only the unigenes of size >1 kb (13,157 sequences and 21,708,495 nucleotides) using the MISA microsatellite identification tool. We identified 3120 SSRs (2253

SSR containing sequences) that defined di- to hexanucleotide repeat motifs with a minimum of five occurrences for tri- to hexanucleotide repeats and six occurrences for dinucleotide repeat motifs. We identified 547 unigene sequences containing more than one SSR and 543 SSRs in compound formation. Among the SSR repeat motifs, dinucleotide repeats were the most common (1628, 52.18%), followed by trinucleotide (1067, 34.2%), and tetranucleotide (11.35%) repeat motifs. The di-, tri-, and tetranucleotide repeat motifs had maximum occurrences of six

(484), five (560), and four (206), respectively. The penta- and hexanucleotide repeat motifs were found in a maximum of four occurrences. The repeat motifs with the number of multiple occurrences have been illustrated in Table 6. The SSRs identified in Cristaria plicata (Patnaik et al. 2016b) and Crassostrea hongkongensis (Tong et al., 2015) transcriptomes showed a similar pattern with the dominancy of dinucleotide repeat motifs. Conversely, the SSR types in Pinctada margaritifera gonad ACCEPTEDtranscriptome (Teaniniuraitemoana MANUSCRIPT et al., 2014) showed tetranucleotides as the most dominant motif, whereas in Pinctada maxima mantle transcriptome (Deng et al., 2014), tetranucleotides were the least represented.

We also reported an analysis of the most dominant repeat motif types among the di- to hexanucleotide repeats as shown in Fig. 9. The most abundant repeat motif types included the

AC/GT (851, 27.28%), AG/CT (567, 18.17%), and AT/AT (197, 6.31%) among dinucleotide repeats, ATC/GTT (403, 12.92%) and AAG/CTT (223, 7.15%) among trinucleotide repeats, and

22

ACCEPTED MANUSCRIPT

ACAG/CTGT (64, 2.05%) among the tetranucleotide repeats. The dinucleotide motif types in C. plicata transcriptome were essentially similar but among the trinucleotide motif types AAT/ATT followed by ATC/AGT were most common (Patnaik et al., 2016b). With mononucleotide repeats not considered for SSR analysis in the present study due to possible homopolymer formation in

Illumina sequencing, the SSR dataset will be suitable for polymorphic microsatellite loci identification. The large array of SSRs identified in S. myomphala transcriptome may be useful for genetic mapping of the species, evolutionary and taxonomic classifications and conservation of the species in natural habitat.

4. Conclusion

In this study, we used Illumina NGS platform to sequence for the first time the transcriptome of the critically endangered land snail S. myomphala. The unigenes identified after de novo assembly of high-quality sequencing reads were annotated against public protein and nucleotide databases for predicted functional classifications. The unigenes were annotated under

GO terms, KEGG pathways, COG classifications, and InterProScan conserved domain analysis. We identified unigeneACCEPTED sequences showing MANUSCRIPT putative functions in immunity, defense and reproduction including the TLRs, intracellular signaling molecules, effectors, oxidative stress enzymes, apoptotic pathway genes, and spermatogenesis associated proteins. We identified a large number of SSR microsatellites that would be valuable for gene polymorphism and variability studies.

Acknowledgments

23

ACCEPTED MANUSCRIPT

This work was supported by the grant entitled “The Genetic and Genomic Evaluation of

Indigenous Biological Resources” funded by the National Institute of Biological Resources

(NIBR201503202).

Reference

Abdul-Muneer, P.M., 2014. Application of Microsatellite markers in conservation genetics and

fisheries management: Recent advances in population structure analysis and conservation

strategies. Genet. Res. Int. Article ID 691759.

Abdul-Muneer, P.M., Gopalakrishnan, A., Musammilu, K.K., Mohindra, V., Lal, K.K., Basheer,

V.S., Lakra, W.S., 2009. Genetic variation and population structure of endemic yellow

catfish, Horabagrus brachysoma (Bagridae) among three populations of Western Ghat

region using RAPD and microsatellite markers. Mol. Biol. Rep. 36, 1779-1791.

Adamson, K.J., Wang, T., Zhao, M., Bell, F., Kuballa, A.V., Storey, K.B., Cummins, S.F., 2015.

Molecular insights into land snail neuropeptides through transcriptome and comparative

gene analysis. BMC Genomics 16, 308. Amin, S., Prentis, P.J.,ACCEPTED Gilding, E.K., Pavasovic, MANUSCRIPT A. 2014. Assembly and annotation of a non- model gastropod (Nerita melanotragus) transcriptome: a comparison of De novo assemblers. BMC Res. Notes 7, 488.

Arias-Perez, A., Fernandez-Tajes, J., Gaspar, M.B., Mendez, J. 2012. Isolation of microsatellite

markers and analysis of genetic diversity among East Atlantic populations of the sword

Razor shell Ensis siliqua: A tool for population management. Biochem. Genet. 50, 397-

415.

24

ACCEPTED MANUSCRIPT

Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P.,

Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L.,

Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M.,

Sherlock, G., 2000. Gene Ontology: tool for the unification of biology. The Gene

Ontology Consortium. Nat. Genet. 25, 25-29.

Bouchut, A., Coustau, C., Gourbal, B., Mitta, G., 2007. Compatibility in the Biomphalaria

glabrata/Echinostoma caproni model: new candidate genes evidenced by a suppressive

subtractive hybridization approach. Parasitology 134, 575-588.

Bourdeau, P.E., Butlin, R.K., Bronmark, C., Edgell, T.C., Hoverman, J.T., Hollander, J., 2015.

What can aquatic gastropods tell us about phenotypic plasticity? A review and meta-

analysis. Heredity 115, 312-321.

Castellanos-Martinez S., Artela D., Catarino S., Gestal C., 2014. De novo transcriptome

sequencing of the Octopus vulgaris hemocytes using Illumina RNA-Seq technology:

response to the infection by the gastrointestinal parasite Aggregata octopiana. PLoS ONE

9, e107873. Che, R., Sun, Y., Wang,ACCEPTED R., Xu, T., 2014. MANUSCRIPT Transcriptomic analysis of endangered Chinese Salamander: Identification of Immune, Sex and Reproduction-related genes and Genetic Markers. PLoS ONE 9, E87940.

Chen, H., Lin, L., Xie, M., Zhang, G., Su, W., 2015. De novo sequencing and characterization of

the Bradysia odoriphaga (Diptera: Sciaridae) larval transcriptome. Comp. Biochem.

Physiol. Part D 16, 20-27.

25

ACCEPTED MANUSCRIPT

Choi, S.L., Lee, Y.S., Rim, Y.S., Kim, T.H., Moroz, L.L., Kandel, E.R., Bhak, J., Kaang, B.K.,

2010. Differential evolutionary rates of neuronal transcriptome in Aplysia kurodai and

Aplysia californica as a tool for gene mining. J. Neurogenet. 24, 75-82.

Cuezzo, M.G., 2003. Phylogenetic analysis of the Camaenidae (Mollusca: )

with special emphasis on the American taxa. Zool. J. Linn. Soc. 138, 449-476.

Deleury, E., Dubreuil, G., Elangovan, N., Wajnberg, E., Reichhart, J.M., Gourbal, B., Duval, D.,

Baron, O.L., Gouzy, J., Coustau, C., 2012. Specific versus non-specific immune

responses in an invertebrate species evidenced by a comparative de novo sequencing

study. PLoS One 7, e32512.

Deng, Y., Lei, Q., Tian, Q., Xie, S., Du, X., Li, J., Wang, L., Xiong, Y., 2014. De novo assembly,

gene annotation, and simple sequence repeat marker development using Illumina paired-

end transcriptome sequences in the pearl oyster Pinctada maxima. Biosci. Biotechnol.

Biochem. 78, 1685-1692.

Ewing, B., Green, P., 1998. Base-calling of automated sequencer traces using phred. II. Error

probabilities. Genome Res. 8, 186-194. Ewing, B., Hillier, L.,ACCEPTED Wendl, M.C., Green, P., MANUSCRIPT 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175-185. Feldmeyer, B., Wheat, C.W., Krezdorn, N., Rotter, B., Pfenninger, M., 2011. Short read Illumina

data for the de novo assembly of a non-model snail species transcriptome (Radix balthica,

Basommatophora, ), and a comparison of assembler performance. BMC

Genomics 12, 317.

26

ACCEPTED MANUSCRIPT

Fiedler, T.J., Hudder, A., McKay, S.J., Shivkumar, S., Capo, T.R., Schmale, M.C., Walsh, P.J.,

2010. The transcriptome of the early life history stages of the California Sea Hare Aplysia

californica. Comp. Biochem. Physiol. Part D 5, 165-170.

Finseth, F.R., Harrison, R.G., 2014. A comparison of next generation sequencing technologies

for transcriptome assembly and utility for RNA-Seq in a non-model bird. PLoS ONE 9,

e108550.

Fleury, E., Huvet, A., 2012. Microarray analysis highlights immune response of Pacific oysters

as a determinant of resistance to summer mortality. Mar. Biotechnol. 14, 203-217.

Gerdol, M., Manfrin, C., De Moro, G., Figueras, A., Novoa, B., Venier, P., Pallavicini, A., 2011.

The C1q domain containing proteins of the Mediterranean mussel Mytilus

galloprovincialis: A widespread and diverse family of immune-related molecules. Dev.

Comp. Immunol. 35, 635-643.

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X.,

Fan, L., Raychowdhury, R., Zeng, Q., Chen, Z., Mauceli, E., Hacohen, E., Gnirke, A.,

Rhind, N., di Palma, F., Birren, B.W., Nusbaum, C., Lindblad-Toh, K., Friedman, N., Regev, A., 2011.ACCEPTED Trinity: reconstructing MANUSCRIPT a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29, 644-652. Guillou, F., Mitta, G., Galinier, R., Coustau, C., 2007. Identification and expression of gene

transcripts generated during an anti-parasitic response in Biomphalaria glabrata. Dev.

Comp. Immunol. 31, 657-671.

Halanych, K.M., Kocot, K.M., 2014. Repurposed transcriptomic data facilitate discovery of

innate immunity toll-like receptor (TLR) genes across Lophotrochozoa. Biol. Bull. 227,

201-209.

27

ACCEPTED MANUSCRIPT

Hansen, K.D., Brenner, S.E., Dudoit, S., 2010. Biases in Illumina transcriptome sequencing

caused by random hexamer priming. Nucleic Acids Res. 38, e131.

Haszprunar, G., Schander, C., Halanych, K.M., 2008. Relationships of higher molluscan taxa. In:

W.F. Ponder and D.R. Lindberg eds., Phylogeny and Evolution of the Mollusca.

University of California Press, Berkeley and Los Angeles, California. Pp. 19-32.

Heyland, A., Vue, Z., Voolstra, C.R., Medina, M., Moroz, L.L., 2011. Developmental

transcriptome of Aplysia californica. J. Expt. Zool. B Mol. Dev. Biol. 0, 113-134.

Jiang, Y., Wu, X., 2007. Characterization of a Rel/NF-κB homologue in a gastropod abalone,

Haliotis diversicolor supertexta. Dev. Comp. Immunol. 31, 121-131.

Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M., 2004. The KEGG resource for

deciphering the genome. Nucleic Acids Res. 32(Database issue), D277-280.

Kang, S.W., Hwang, H.J., Park, S.Y., Wang, T.H., Park, E.B., Lee, T.H., Hwang, U.W., Lee, J.S.,

Park, H.S., Han, Y.S., Lim, C.E., Kim, S., Lee, Y.S., 2014. Mollusks sequence database:

Version II. Kor. J. Malacol. 30, 429-431.

Kang, S.W., Park, S.Y., Patnaik, B.B., Hwang, H.J., Kim, C., Kim, S., Lee, J.S., Han, Y.S., Lee, Y.S. 2015. ConstructionACCEPTED of PANM databaseMANUSCRIPT (Protostome DB) for rapid annotation of NGS data in mollusks. Korean J Malacol. 31, 243-247. Kocot, K.M., Cannon, J.T., Todt, C., Citarella, M.R., Kohn, A.B., Meyer, A., Santos, S.R.,

Schander, C., Moroz, L.L., Lieb, B., Halanych, K.M., 2011. Phylogenomics reveals deep

molluscan relationships. Nat. 477, 452-456.

Kohler, F., 2010. Three new species and two new genera of land snails from the Bonaparte

Archipelago in the Kimberley, Western Australia (Pulmonata, Camaenidae). Mollusc.

Res. 30, 1-16.

28

ACCEPTED MANUSCRIPT

Kohler, F., Criscione, F., 2015. A molecular phylogeny of camaenid land snails from north-

western Australia unravels widespread homoplasy in morphological characters

(Gastropoda, ). Mol. Phylogenet. Evol. 83, 44-55.

Liu, Z.J., Cordes, J.F., 2004. DNA marker technologies and their applications in aquaculture

genetics. Aquaculture 238, 1-37.

Lockwood, B.L., Somero, G.N., 2011. Transcriptomic responses to salinity stress in invasive and

native blue mussels (genus Mytilus). Mol. Ecol. 20, 517-529.

Lockyer, A.E., Spinks, J., Kane, R.A., Hoffmann, K.F., Fitzpatrick, J.M., Rollinson, D., Noble,

L.R., Jones, C.S., 2008. Biomphalaria glabrata transcriptome: cDNA microarray

profiling identifies resistant- and susceptible-specific gene expression in hemocytes from

snail strains exposed to Schistosoma mansoni. BMC Genomics 9, 634.

Matsumoto, T., Yamano, K., Kitamura, M., Hara, A., 2008. Ovarian follicle cells are the site of

vitellogenin synthesis in the Pacific abalone Haliotis discus hannai. Comp. Biochem.

Physiol. Part A 149, 293-298.

Metzker, M.L., 2010. Sequencing technologies – the next generation. Nat. Rev. Genet. 11, 31-46. Moreira, R., Milan, ACCEPTED M., Balseiro, P., Romero, MANUSCRIPT A., Babbucci, M., Figueras, A., Bargelloni, L., Novoa, B., 2014. Gene expression profile analysis of Manila clam (Ruditapes philippinarum) hemocytes after a Vibrio alginolyticus challenge using an immune-

enriched oligo-microarray. BMC Genomics 15, 267.

Morozova, O., Marra, M.A., 2008. Applications of next-generation sequencing technologies in

functional genomics. Genomics 92, 255-264.

Mukhopadhyay, S., Gordon, S., 2004. The role of scavenger receptors in pathogen recognition

and innate immunity. Immunobiol. 209, 39-49.

29

ACCEPTED MANUSCRIPT

Ni, D., Song, L., Wu, L., Chang, Y., Yu, Y., Qiu, L., Wang, L., 2007. Molecular cloning and

mRNA expression of peptidoglycan recognition protein (PGRP) gene in bay scallop

(Argopecten irradians, Lamarck 1819). Dev. Comp. Immunol. 31, 548-558.

Pairett, A.N., Serb, J.M., 2013. De novo assembly and characterization of two transcriptomes

reveal multiple light-mediated functions in the Scallop Eye (Bivalvia: Pectinidae). PLoS

ONE 8, e69852.

Park, S.Y., Patnaik, B.B., Kang, S.W., Hwang, H-J., Chung, J.M., Song, D.K., Sang, M.K.,

Patnaik, H.H., Lee, J.B., Noh, M.Y., Kim, C., Kim, S., Park, H.S., Lee, J.S., Han, Y.S.,

Lee, Y.S., 2016. Transcriptomic analysis of the endangered neritid species Clithon

retropictus: De novo assembly, functional annotation and marker discovery. Genes 7, 35.

Patnaik, B.B., Hwang, H.J., Kang, S.W., Park, S.Y., Wang, T.H., Park, E.B., Chung, J.M., Song,

D.K., Kim, C., Kim, S., Lee, J.B., Jeong, H.C., Park, H.S., Han, Y.S., Lee, Y.S., 2015.

Transcriptome characterization for Non-Model Endangered Lycaenids, Protantigius

superans and Spindasis takanosis, using Illumina HiSeq 2500 Sequencing. Int. J. Mol.

Sci. 16, 29948-29970. Patnaik, B.B., Park, ACCEPTEDS.Y., Kang, S.W., Hwang, MANUSCRIPT H.J., Wang, T.H., Park, E.B., Chung, J.M., Song, D.K., Kim, C., Kim, S., Lee, J.B., Jeong, H.C., Park, H.S., Han, Y.S., Lee, Y.S., 2016a. Transcriptome profile of the Asian Giant Hornet (Vespa mandarinia) using Illumina

HiSeq 4000 Sequencing: De novo assembly, functional annotation, and discovery of SSR

markers. Int. J. Genomics Article ID 4169587.

Patnaik, B.B., Wang, T.H., Kang, S.W., Hwang, H.J., Park, S.Y., Park, E.B., Chung, J.M., Song,

D.K., Kim, C., Kim, S., Lee, J.S., Han, Y.S., Park, H.S., Lee, Y.S., 2016b. Sequencing,

de novo assembly, and annotation of the transcriptome of the endangered freshwater pearl

30

ACCEPTED MANUSCRIPT

bivalve, Cristaria plicata, provides novel insights into functional genes and marker

discovery. PLoS ONE (Accepted publication), doi: 10.1371/journal.pone.0148622.

Perez-Portela, R., Turon, X., Riesgo, A., 2015. Characterization of the transcriptome and gene

expression of four different tissues in the ecologically relevant sea urchin Arbacia lixula

using RNA-seq. Mol. Ecol. Res. Dec. 9.

Pertea, G., Huang, X., Liang, F., Antonescu, V., Sultana, R., Karamycheva, S., Lee, Y., White, J.,

Cheung, F., Parvizi, B., Tsai, J., Quackenbush, J., 2003. TIGR Gene Indices Clustering

Tool (TGICL): a software system for fast clustering of large EST datasets.

Bioinformatics 19, 651-652.

Prentis, P.J., Pavasovic, A., 2014. The Anadara trapezia transcriptome: a resource for molluscan

physiological genomics. Mar. Genomics Pt B, 113-115.

Puinean, A.M., Rotchell, J.M., 2006. Vitellogenin gene expression as a biomarker of endocrine

disruption in the invertebrate, Mytilus edulis. Mar. Environ. Res. 62 Suppl. S211-214.

Qin, Z., Li, Y., Sun, D., Shao, M., Zhang, Z., 2012. Cloning and expression analysis of the

vitellogenin gene in the scallop Chlamys farreri and the effects of estradiol-17β on its synthesis. Invertebr.ACCEPTED Biol. 131, 312-321. MANUSCRIPT Qiu, L., Song, L., Xu, W., Ni, D., Yu, Y., 2007. Molecular cloning and expression of a Toll receptor gene homologue from Zhikong scallop, Chlamys farreri. Fish Shellfish Immunol.

22, 451-466.

Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., Lopez, R., 2005.

InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116-120.

Renault, T., Faury, N., Barbosa-Solomieu, V., Moreau, K., 2011. Suppression subtractive

hybridization (SSH) and real time PCR reveal differential gene expression in the Pacific

31

ACCEPTED MANUSCRIPT

cupped oyster, Crassostrea gigas, challenged with Ostreid herpesvirus 1. Dev. Comp.

Immunol. 35, 725-735.

Rhee, S.Y., Wood, V., Dolinski, K., Draghici, S., 2008. Use and misuse of the gene ontology

annotations. Nat. Rev. Genet. 9, 509-515.

Richardson, M.F., Sherman, C.D.H., 2015. De novo assembly and characterization of the

invasive Northern Pacific Seastar transcriptome. PLoS ONE 10, e0142003.

Rogers, M.F., Ben-Hur, A., 2009. The use of gene ontology evidence codes in preventing

classifier assessment bias. Bioinformatics 25, 1173-1177.

Shaw, T.I., Srivastava, A., Chou, W.C., Liu, L., Hawkinson, A., Glenn, T.C., Adams, R.,

Schountz, T., 2012. Transcriptome sequencing and annotation for the Jamaican Fruit Bat

(Artibeus jamaicensis). PLoS ONE 7, e48472.

Simakov, O., Larsson, T.A., Arendt, D., 2013. Linking micro- and macro-evolution at the cell

type level: a view from the lophotrochozoan Platynereis dumerilii. Brief. Funct.

Genomics 12, 430-439.

Su, J., Ni, D., Song, L., Zhao, J., Qiu, L., 2007. Molecular cloning and characterization of a short type peptidoglycanACCEPTED recognition protein MANUSCRIPT (CfPGRP-S1) cDNA from Zhikong scallop Chlamys farreri. Fish Shellfish Immunol. 23, 646-656. Suzuki, T., Hara, A., Yamaguchi, K., Mori, K., 1992. Purification and immunolocalization of a

vitelline-like protein from the Pacific Oyster Crassostrea gigas. Mar. Biol. 11, 3239-

3245.

Tanguy, A., Guo, X., Ford, S.E., 2004. Discovery of genes expressed in response to Perkinsus

marinus challenge in Eastern (Crassostrea virginica) and Pacific (C. gigas) oysters. Gene

338, 121-131.

32

ACCEPTED MANUSCRIPT

Tasumi, S., Vasta, G.R., 2007. A galectin of unique domain organization from hemocytes of the

Eastern Oyster (Crassostrea virginica) is a receptor for the protistan parasite Perkinsus

marinus. J. Immunol. 179, 3086-3098.

Tatewaki, R., Kitada, J., 1987. Karyological studies of five species of land snails (Helicoidea:

Mollusca). Genetica 74, 73-80.

Teaniniuraitemoana, V., Huvet, A., Levy, P., Klopp, C., Lhuillier, E., Gaertner-Mazouni, N.,

Gueguen, Y., Le Moullac, G., 2014. Gonad transcriptome analysis of pearl oyster

Pinctada margaritifera: identification of potential sex differentiation and sex determining

genes. BMC Genomics 15, 491.

Tong, Y., Zhang, Y., Huang, J., Xiao, S., Zhang, Y., Li, J., Chen, J., Yu, Z., 2015.

Transcriptomics analysis of Crassostrea hongkongensis for the discovery of

reproduction-related genes. PLoS ONE 10, e0134280.

Uliano-Silva, M., Americo, J.A., Brindeiro, R., Dondero, F., Prosdocimi, F., de Freitas Rebelo,

M., 2014. Gene discovery through transcriptome sequencing for the invasive mussel

Limnoperna fortunei. PLoS ONE 9, e102973. Wang, L., Wang, L.,ACCEPTED Huang, M., Zhang, H., Song,MANUSCRIPT L., 2011. The immune role of C-type lectins in molluscs. ISJ 8, 241-246. Wang, W., Hui, J.H.L., Chan, T.F., Chu, K.H., 2014. De novo transcriptome sequencing of the

snail Echinolittorina malaccana: Identification of genes responsive to thermal stress and

development of genetic markers for population studies. Mar. Biotechnol. 16, 547-559.

Wu, X., Xiong, X., Xie, L., Zhang, R., 2007. Pf-Rel, a Rel/Nuclear factor-κB homolog identified

from the Pearl Oyster, Pinctada fucata. Acta Biochim. Et Biophys. Sinica 39, 533-539.

33

ACCEPTED MANUSCRIPT

You, F.M., Huo, N., Gu, Y.Q., Luo, M.C., Ma, Y., Hane, D., Lazo, G.R., Dvorak, J., Anderson,

O.D., 2008. BatchPrimer3: A high throughput web application for PCR and sequencing

primer design. 9, 253.

Yue, H., Li, C., Du, H., Zhang, S., Wei, Q., 2015. Sequencing and de novo assembly of the

Gonadal transcriptome of the endangered Chinese Sturgeon (Acipenser sinensis). PLoS

ONE 10, e0127332.

Zeng, D., Chen, X., Xie, D., Zhao, Y., Yang, C., Li, Y., Ma, N., Peng, M., Yang, Q., Liao, Z.,

Wang, H., Chen, X., 2013. Transcriptome analysis of Pacific White Shrimp (Litopenaeus

vannamei) hepatopancreas in response to Taura Syndrome Virus (TSV) experimental

infection. PLoS ONE 8, e57515.

Zhang, G., Fang, X., Guo, X., Li, L., Luo, R., Xu, F., Yang, P., Zhang, L., Wang, X., Qi, H.,

Xiong, Z., Que, H., Xie, Y., Holland, P.W.H., Paps, J., Zhu, Y., Wu, F., Chen, Y., Wang,

J., Peng, C., Meng, J., Yang, L., Liu, J., Wen, B., Zhang, N. et al., 2012. The oyster

genome reveals stress adaptation and complexity of shell formation. Nat. 490, 49-54.

Zhang, L., Li, L., Zhu, Y., Zhang, G., Guo, X., 2014. Transcriptome analysis reveals a rich gene set related toACCEPTED innate immunity in theMANUSCRIPT Eastern Oyster (Crassostrea virginica). Mar Biotechnol. 16, 17-33. Zhang, Y., He, X., Yu, F., Xiang, Z., Li, J., Thorpe, K.L., Yu, Z., 2013. Characteristic and

functional analysis of Toll-like receptors (TLRs) in the lophotrochozoan, Crassostrea

gigas, reveals ancient origin of TLR-mediated innate immunity. PLoS ONE 8, e76464.

Zhao, J., Song, L., Li, C., Zou, H., Ni, D., Wang, W., Xu, W., 2007. Molecular cloning of an

invertebrate goose-type lysozyme gene from Chlamys farreri, and lytic activity of the

recombinant protein. Mol. Immunol. 44, 1198-1208.

34

ACCEPTED MANUSCRIPT

Zheng, H., Zhang, Q., Liu, H., Liu, W., Sun, Z., Li, S., Zhang, T., 2012. Cloning and expression

of vitellogenin (Vg) gene and its correlations with total carotenoids content and total

antioxidant capacity in noble scallop Chlamys nobilis (Bivalve: Pectinidae). Aquaculture

366-367, 46-53.

Figure legends

Figure 1. The length distribution of contigs (A) and unigenes (B) from the de novo assembly process of S. myomphala transcriptome. The „x-axis‟ indicates the length of contigs or unigenes and the „y-axis‟ indicates the number of contigs.

Figure 2. A three-way Venn diagram depicting the unique and overlapped unigenes showing homology to sequences in PANM, Unigene, and COG databases. The BLASTX and BLASTN programs at an E-value cutoff of 1e-5 were used for the unigene sequence annotation against

PANM/COG and Unigene databases, respectively. ACCEPTED MANUSCRIPT Figure 3. Homology analysis of BLASTX hits against PANM database. (A) E-value distribution, (B) Identity distribution, (C) Similarity distribution, and (D) the ratio of unigene hits to non-hits.

Figure 4. Top-hit species distribution based on BLASTX annotation against PANM database.

35

ACCEPTED MANUSCRIPT

Figure 5. Clusters of Orthologous Groups (COG) classifications of S. myomphala unigenes to

25 functional categories excluding „multi‟ category. The COG classifications are on the „x-axis‟ denoted by capital letters and the number of unigenes in the „y-axis‟.

Figure 6. The suggested classification of S. myomphala unigenes to GO term functional categories. (A) A three-way Venn diagram depicting the assignment of unigenes to „biological process‟, „cellular component‟, and „molecular function‟ categories. (B) Number of unigenes ascribed to number of GO terms.

Figure 7. GO classifications of unigenes at level 2. (A) Biological process category. (B) Cellular component category, and (C) Molecular function category. The „x-axis‟ indicates the top-10 GO terms and the „y-axis‟ indicates the number of unigenes.

Figure 8. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. The S. myomphala visceral mass unigenes were assigned to KEGG pathways (outer circle). The sequences showing ACCEPTED annotation to enzymes within MANUSCRIPT each KEGG pathway is shown in the inner circle. Each pathway is represented in different color.

Figure 9. The SSR repeat motif types in the unigenes (>1 kb sequences) of S. myomphala transcriptome. The „x-axis‟ represents the repeat motif types and the „y-axis‟ the number of unigenes.

36

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

37

ACCEPTED MANUSCRIPT

Figure 1

ACCEPTED MANUSCRIPT

38

ACCEPTED MANUSCRIPT

Figure 2

ACCEPTED MANUSCRIPT

39

ACCEPTED MANUSCRIPT

Figure 3

ACCEPTED MANUSCRIPT

40

ACCEPTED MANUSCRIPT

Figure 4

ACCEPTED MANUSCRIPT

41

ACCEPTED MANUSCRIPT

Figure 5

ACCEPTED MANUSCRIPT

42

ACCEPTED MANUSCRIPT

Figure 6

ACCEPTED MANUSCRIPT

43

ACCEPTED MANUSCRIPT

Figure 7

ACCEPTED MANUSCRIPT

44

ACCEPTED MANUSCRIPT

Figure 8

ACCEPTED MANUSCRIPT

45

ACCEPTED MANUSCRIPT

Figure 9

ACCEPTED MANUSCRIPT

46

ACCEPTED MANUSCRIPT

Table 1. Summary of Illumina 2500 sequencing, assembly and analysis of S. myomphala transcriptomic sequences.

Raw reads Number of sequences 264,452,328 Number of bases 33,320,993,328 Clean reads Number of sequences 260,925,235 Number of bases 32,366,103,440 High-quality reads (%) 98.67 (sequences), 97.13 (bases) Contig assembly Total number of contig 149,251 Number of bases 78,229,451 Mean length of contig (bp) 524.1 N50 length of contig (bp) 606 GC % of contig 42.90 Largest contig (bp) 11,186 No. of large contigs (≥500bp) 45,726 Unigene assembly Total number of unigenes 103,774 Number of bases 59,325,217 Mean lengthACCEPTED of unigene (bp) MANUSCRIPT571.7 N50 length of unigene (bp) 705 GC % of unigene 42.86 Length ranges (bp) 133-13,172

47

ACCEPTED MANUSCRIPT

Table 2. Annotation of unigenes of the S. myomphala transcriptome against public databases using BLASTX and BLASTN program.

Databases All annotated ≤300 (bp) 300-1000 (bp) ≥1000 (bp) unigenes PANM-DB 37,959 8620 20,268 9071 Unigene 12,890 2431 6707 3752 COG 17,699 2756 8749 6194 GO 14,451 2799 7165 4487 KEGG 1453 255 683 515 ALL 41,245 9470 22,316 9459

ACCEPTED MANUSCRIPT

48

ACCEPTED MANUSCRIPT

Table 3. Summary of top 25 protein domains predicted in S. myomphala unigene sequences.

IPR accession Domain name Domain description unigene IPR015880 Znf_C2H2 Zinc finger, C2H2-like domain 518 IPR027417 P-loop_NTPase P-loop containing nucleoside triphosphate hydrolase domain 230 IPR000477 RT_dom Reverse transcriptase domain 203 IPR012336 Thioredoxin-like_fold Thioredoxin-like fold domain 141 IPR013087 Znf_C2H2/integrase_DNA-bd Zinc finger C2H2-type/integrase DNA-binding domain 134 IPR000504 RRM_dom RNA recognition motif domain 131 IPR005135 Endo/exonuclease/phosphatase Endonuclease/exonuclease/phosphatase domain 121 IPR002110 Ankyrin_rpt Ankyrin repeat 117 IPR012337 RNaseH-like_dom Ribonuclease H-like domain 112 IPR001680 WD40_repeat WD40 repeat 109 IPR003591 Leu-rich_rpt_typical-subtyp Leucine-rich repeat, typical subtype repeat 109 IPR001304 C-type_lectin C-type lectin domain 100 IPR013783 Ig-like_fold Immunoglobulin-like fold domain 97 IPR011989 ARM-like Armadillo-like helical domain 95 IPR002048 EF_hand_dom EF-hand domain 92 IPR002290 Ser/Thr_dual-sp_kinase Serine/threonine/dual specificity protein kinase, catalytic domain 90 IPR000719 Prot_kinase_dom Protein kinase domain 82 IPR011701 MFS Major facilitator superfamily 80 IPR000276 GPCR_Rhodpsn G protein-coupled receptor, rhodopsin-like family 79 IPR001611 Leu-rich_rpt Leucine-rich repeat 78 IPR029058 AB_hydrolase Alpha/Beta hydrolase fold domain 77 IPR002126 Cadherin Cadherin domain 76 IPR001314 Peptidase_S1A Peptidase S1A, chymotrypsin-type family 68 IPR002035 VWF_A von Willebrand factor, type A domain 64 IPR011992 EF-hand-dom_pair EF-hand domain pair domain 63

ACCEPTED MANUSCRIPT

49

ACCEPTED MANUSCRIPT

Table 4. Genes of interest for immunity and defense in S. myomphala unigene sequences Protein family Number of unigenes Note Toll-like receptor 32 Includes Toll-like receptor 2, -3, -4, -6, -7, -13; Toll-interacting protein

Scavenger receptor 6 Includes Scavenger receptor class B, - class F; endothelial cells scavenger receptor

CD 3 Includes CD63

PGRP 3 Includes PGRP 1-like, PGRP-short form Lectins 24 Includes C-type lectin-2, -3, -4, -8; lectin-C; endoplasmic reticulum lectin-1; mannan-binding lectin; sialic acid binding lectin-2, - 3; D-galacturonic acid binding lectin precursor Galectins 15 Includes tandem repeat galectin; galectin-3 like isoform; galectin-4 like isoform

FBG-domain-containing 12 Includes fibrinogen-related protein-1, -11, -12.1, -14.1 protein Thioester-containing protein 26 Includes thioester containing protein C1q domain protein 23 Includes complement domain protein; complement C1q-like protein 2; complement C1q TNF-related protein -3, -4, -9A; complement C1q-like protein 4 Lysozyme 10 Includes G-type lysozyme Serpin 39 Includes serpin B3-like, -B4-like, -B6-like NOS 2 Includes nitric oxide synthase-interacting protein-like Catalase 6 Includes catalase Superoxide dismutase 7 Includes superoxide dismutase (Cu/Zn); superoxide dismutase Mn, mitochondrial-like Glutathione peroxidase 8 Includes glutathione peroxidase; epididymal secretory glutathione peroxidase-like Glutathione-S-transferase 47 Includes glutathione-S-transferase theta and omega Glutathione synthetase 2 Includes glutathione synthetase Cactus/IκB 2 Includes NFκB inhibitor cactus-like Relish 2 Includes relish TBK 5 Includes serine/threonine-protein kinase TBK1- like TRAF 46 Includes TRAF-3 interacting protein 1 TRIAD 5 IncludesACCEPTED histidine triad proteins MANUSCRIPT p38 2 Includes p38 mitogen activated protein kinase AP-1 9 Includes AP-1 complex subunit protein MyD88 2 Includes myeloid differentiation primary response protein MyD88-like

TNF 24 Includes TNF receptor associated factor-2, -3, -4, -5, -6 like; lipopolyssacharide induced TNF factor 2-b Caspase 13 Includes caspase 1, -2, -3, -6, -7, -8, -10 Bcl-2 9 Includes apoptosis regulator Bcl-2 Bax 3 Includes apoptosis regulator Bax HSP 30 Includes HSP-10, -20, -40, -60, -70, -75, -90; heat shock cognate 3; small HSP-36

50

ACCEPTED MANUSCRIPT

Table 5. Reproduction-related protein families predicted for the unigenes of S. myomphala.

Protein family Unigene ID Sperm protamine P1-like Sm_Uni_101535 Asparagine rich candidate Sm_Uni_090887 spermatophore protein Spermatid-specific protein T2-like Sm_Uni_001679; Sm_Uni_003858; Sm_Uni_057032 Spermatogenesis associated protein-1 Sm_Uni_009048 Spermatogenesis associated protein-5- Sm_Uni_012313; Sm_Uni_014437; Sm_Uni_060906 like Spermatogenesis associated protein-6- Sm_Uni_070097 like Spermatogenesis associated protein-4- Sm_Uni_020334; Sm_Uni_110402 like Spermatogenesis associated protein-7- Sm_Uni_033301; Sm_Uni_037712; Sm_Uni_050634; Sm_Uni_050635; like Sm_Uni_067762; Sm_Uni_080018 Spermatogenesis associated protein-24- Sm_Uni_035667; Sm_Uni_039412 like Spermatogenesis associated protein-22- Sm_Uni_045241; Sm_Uni_045242; Sm_Uni_045243 like Spermatogenesis associated protein-20- Sm_Uni_070892; Sm_Uni_091052 like Spermatogenesis associated protein-17- Sm_Uni_090294 like Spermatozoon associated kinase Sm_Uni_012403 Sperm flagellar protein-like Sm_Uni_012457; Sm_Uni_047712; Sm_Uni_075966; Sm_Uni_090182 Spermine oxidase-like Sm_Uni_012748; Sm_Uni_017134; Sm_Uni_018512; Sm_Uni_054565; Sm_Uni_065003; Sm_Uni_078657; Sm_Uni_090233 Spermidine synthase-like Sm_Uni_013517; Sm_Uni_030685 Spermatogenesis defective protein 39 Sm_Uni_014962; Sm_Uni_030038; Sm_Uni_030039 Motile sperm domain containing Sm_Uni_017572; Sm_Uni_059069; Sm_Uni_104647; Sm_Uni_107534 protein Sperm protein Sm_Uni_018504 Sperm associated antigen 6-like Sm_Uni_047327; Sm_Uni_051656 Sperm associated antigen 16 protein- Sm_Uni_020948 like Sperm associated antigenACCEPTED 8-like Sm_Uni_037748; MANUSCRIPT Sm_Uni_047999 Sperm associated antigen 17-like Sm_Uni_063779; Sm_Uni_085263 Sperm associated antigen 1-like Sm_Uni_102940; Sm_Uni_103842 Sperm tail PG-rich repeat containing Sm_Uni_038879 protein-2 Nuclear autoantigenic sperm protein- Sm_Uni_040379; Sm_Uni_074486; Sm_Uni_074487 like Sperm specific protein PHI 2B/PHI3- Sm_Uni_040518 like Prostatic spermine binding protein Sm_Uni_055146 Spermatid perinuclear RNA-binding Sm_Uni_057679 protein Oocyte zinc-finger protein Sm_Uni_063399; Sm_Uni_079333 Vitellogenin-1 Sm_Uni_021344; Sm_Uni_041083; Sm_Uni_054701; Sm_Uni_091395 Vitellogenin receptor Sm_Uni_095099 Vitelline envelope zona pellucida Sm_Uni_007430; Sm_Uni_107614 domain 14

51

ACCEPTED MANUSCRIPT

Table 6. Simple sequence repeat (SSR) types in the Satsuma myomphala transcriptome.

Repeat Numbers Motif Length 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ≥21 Number* %** Di 0 0 484 287 162 154 68 96 77 19 19 21 18 16 20 22 11 154 1628 52.18 Tri 0 560 196 92 82 11 16 7 8 5 15 4 2 9 5 4 10 41 1067 34.2 Tetra 206 67 37 4 7 1 1 5 0 3 2 4 5 1 0 1 0 10 354 11.35 Penta 46 7 0 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 56 1.79 Hexa 12 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 0.48 Total 264 634 720 385 251 166 85 108 85 27 37 29 25 26 25 27 21 205 3120 100.00 *Number of SSRs detected in the unigenes (>1 kb); **Relative percent of SSRs with different motif lengths

ACCEPTED MANUSCRIPT

52

ACCEPTED MANUSCRIPT

Table S1. Preprocessing of raw read sequences of S. myomphala using the Cutadapt program with default parameters for paired-end reads.

Data Processing Parameters for S. myomphala Raw reads - Number of sequences 264,452,328 - Number of bases 33,320,993,328 Total read pairs processed 132,226,164 - Read 1 with adapter 4,587,307 (3.5%) - Read 2 with adapter 6,293,088 (4.8%) Pairs written (passing filters) 132,226,164 Total base pairs processed (bp) 33,320,993,328 - Read 1 (bp) 16,660,496,664 - Read 2 (bp) 16,660,496,664 Total written (filtered) (bp) 33,265,089,270 - Read 1 (bp) 16,635,365,094 - Read 2 (bp) 16,629,724,176 % of reads after trimming 99.83 % of reads discarded 0.17 average length after trimming (bp) 125.8

Adapter 1 sequence AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC Adapter 2 sequence AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTAT CATTACCEPTED MANUSCRIPT

53

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

54