Article Transcriptome Characterization for Non-Model Endangered Lycaenids, Protantigius superans and Spindasis takanosis, Using Illumina HiSeq 2500 Sequencing

Bharat Bhusan Patnaik 1,2,†, Hee-Ju Hwang 1,†, Se Won Kang 1, So Young Park 1, Tae Hun Wang 1, Eun Bi Park 1, Jong Min Chung 1, Dae Kwon Song 1, Changmu Kim 3, Soonok Kim 3, Jae Bong Lee 4, Heon Cheon Jeong 5, Hong Seog Park 6, Yeon Soo Han 7 and Yong Seok Lee 1,*

Received: 7 October 2015; Accepted: 9 December 2015; Published: 16 December 2015 Academic Editor: Lee A. Bulla

1 Department of Life Science and Biotechnology, College of Natural Sciences, Soonchunhyang University, 22 Soonchunhyangro, Shinchang-myeon, Asan, Chungcheongnam-do 31538, Korea; [email protected] (B.B.P.); [email protected] (H.-J.H.); [email protected] (S.W.K.); [email protected] (S.Y.P.); [email protected] (T.H.W.); [email protected] (E.B.P.); [email protected] (J.M.C.); [email protected] (D.K.S.) 2 Trident School of Biotech Sciences, Trident Academy of Creative Technology (TACT), Chandaka Industrial Estate, Chandrasekharpur, Bhubaneswar, Odisha 751024, India 3 National Institute of Biological Resources, 42, Hwangyeong-ro, Seo-gu, Incheon 22689, Korea; [email protected] (C.K.); [email protected] (S.K.) 4 Korea Zoonosis Research Institute (KOZRI), Chonbuk National University, 820-120 Hana-ro, Iksan, Jeollabuk-do 54528, Korea; [email protected] 5 Hampyeong County Institute, Hampyeong County Agricultural Technology Center, 90, Hakgyohwasan-gil, Hakgyo-myeon, Hampyeong-gun, Jeollanan-do 57158, Korea; [email protected] 6 Research Institute, GnC BIO Co., LTD. 621-6 Banseok-dong, Yuseong-gu, Daejeon 34069, Korea; [email protected] 7 College of Agriculture and Life Science, Chonnam National University 77 Yongbong-ro, Buk-gu, Gwangju 61186, Korea; [email protected] * Correspondance: [email protected]; Tel.: +82-10-4727-5524; Fax: +82-41-530-1256 † These authors contributed equally to this work.

Abstract: The butterflies, Protantigius superans and Spindasis takanosis, are endangered in Korea known for their symbiotic association with ants. However, necessary genomic and transcriptomics data are lacking in these species, limiting conservation efforts. In this study, the P. superans and S. takanosis transcriptomes were deciphered using Illumina HiSeq 2500 sequencing. The P. superans and S. takanosis transcriptome data included a total of 254,340,693 and 245,110,582 clean reads assembled into 159,074 and 170,449 contigs and 107,950 and 121,140 unigenes, respectively. BLASTX hits (E-value of 1.0 ˆ 10´5) against the known protein databases annotated a total of 46,754 and 51,908 transcripts for P. superans and S. takanosis. Approximately 41.25% and 38.68% of the unigenes for P. superans and S. takanosis found homologous sequences in Protostome DB (PANM-DB). BLAST2GO analysis confirmed 18,611 unigenes representing Gene Ontology (GO) terms and a total of 5259 unigenes assigned to 116 pathways for P. superans. For S. takanosis, a total of 6697 unigenes were assigned to 119 pathways using the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database. Additionally, 382,164 and 390,516 Simple Sequence Repeats (SSRs) were compiled from the unigenes of P. superans and S. takanosis, respectively. This is the first report to record new genes and their utilization for conservation of lycaenid species population and as a reference information for closely related species.

Int. J. Mol. Sci. 2015, 16, 29948–29970; doi:10.3390/ijms161226213 www.mdpi.com/journal/ijms Int. J. Mol. Sci. 2015, 16, 29948–29970

Keywords: Protantigius superans; Spindasis takanosis; endangered species; transcriptome; Illumina sequencing; BLAST2GO; SSRs (simple sequence repeats)

1. Introduction Butterflies form an invincible part of the Earth’s rich biodiversity and are considered as quality-of-life indicators. Butterflies accord aesthetic value, are part of our natural heritage, and are portrayed as a symbol of beauty, peace or freedom. They are also of immense scientific value and have been used in diverse areas of biological research including pest control, population dynamics, biodiversity conservation, evolution, and genetics. Lately, they have been studied in the context of climate change and global warming. This is not surprising, as populations of many butterfly species have declined and many show substantial changes in their distribution. The primary cause for the vulnerability of butterfly species includes the damage to habitations due to human activities, agricultural activities and air pollution [1,2]. The consistent decline in butterfly populations has prompted ecologists to engage in species and community conservation initiatives that explore themes such as the evolutionary origins of butterfly diversity, population dynamics and threats (parasitism, predation and human impacts). Other scientific strategies used for the conservation assessment of butterfly populations around the globe include expressed sequence information, which is whole genome or transcriptome sequencing to characterize genes for positive selection of the species. In the context of the declining butterfly populations, a Red List assessment is imperative as it addresses the likelihood and predisposition of the species in becoming extinct in their natural environments and hence promotes the prioritization and strict assessment of conservation programs [1]. The Red List of butterflies has been published on different scales and from different countries following the guidelines of the International Union for Conservation of Nature and Natural Resources (IUCN) [1,3,4]. In the Korean peninsula, an exhaustive investigation on the endangered butterfly populations was conducted to identify critically endangered, endangered, and vulnerable species to understand the causes of decline over time [5]. Subsequent to this, a Red List assessment was conducted in Korea using the IUCN Red List Categories and Criteria (version 3.1) and IUCN Red List Categories and Criteria (version 8.0), that classified the status of and plant species into several categories [6]. From these assessments, we summarized that a majority of threatened butterflies (classified as Critically Endangered, Endangered, and Vulnerable) belonged to Family Lycaenidae and Nymphalidae of the Lepidopteran taxon. The Lycaenidae Family of butterflies (gossamer-winged butterflies) represents around 40% of all known butterfly species. Generally, the butterfly species from the family (about 75% of species) show a mutualistic or commensal relationship with ants in nature, but few have evolved into parasites of ants [7,8]. Many such ant-parasitic Lycaenid species are highly endangered and are the prime focus of insect conservation biology [9]. A total of 15 Lycaenid species have been included in the Korean Red List of Threatened Species (2014). The population of these Lycaenid species have shown a decline due to climate change and loss of symbiotic ants. An extinction in host ant species is detrimental to the existence of such associated ant-parasitic Lycaenid butterflies, hence a conservation of the former is also a priority to protect the latter in the Korean peninsula [10]. The Lycaenidae family species Spindasis takanosis and Protantigius superans were listed as endangered in Korea by a Ministry of Environment report in 2005 [11]. These Lycaenid have been classified as Level 2 species and are vulnerable to face extinction unless threatening factors are eliminated or mitigated. Spindasis takanosis (Matsumara) has been reported from Gyunggi-do, Gangwon-do, Chungchungnam-do and Jeollanam-do provinces of Korea. The population of this species has declined alarmingly due to the loss of forests and symbiotic ants such as Crematogaster matsumurai [12]. S. takanosis belongs to the majority non-parasitic species of Lycaenids that are symbionts of C. matsumurai [8]. Protantigius superans was not included as a threatened

29949 Int. J. Mol. Sci. 2015, 16, 29948–29970 butterfly species in an earlier study [5], but it was included as a vulnerable species by the Korean Red List assessment (2014). The ant-parasitism status of P. superans has not been reported. In fact, there are only scattered occurrences of ant-parasitism among the Lycaenidae species, suggesting an independent evolution of such interactions. The major barrier for the conservation efforts even after the legalization of the Korean Endangered Species Act in 2005, was the lack of data that determined the genetics, ecology and distribution of the species. As an initiative towards preserving the gene pool of the Lycaenid species S. takanosis and P. superans, the genomic and RNA sequence information initiative was considered as a foolproof strategy, more so after the completion of the mitogenome sequencing for the species [13,14]. The need for conservation genomics initiatives towards biodiversity assessment and management has been advocated, especially with the rapid evolution of Next-Generation Sequencing (NGS) platforms [15,16]. The NGS enables the derivation of a global gene or RNA expression profile that could lead to the discovery of genetic markers such as Simple Sequence Repeats (SSRs) and Single Nucleotide Polymorphisms (SNPs), and candidate transcripts as markers for ecological fitness, quantitative trait loci (QTL) and so on [17,18]. The transcriptome sequencing of butterfly species using NGS has provided with clues towards understanding the population genetic structure and setting conservation goals and priorities. A case in this initiative pertains to the rapidly declining populations of the marsh fritillary butterfly, Euphydryas aurinia, wherein next-generation 454-pyrosequencing characterized seven novel microsatellite loci [19]. Other 454-sequencing data has established the transcriptome of the Glanville fritillary butterfly (Melitaea cinxia) appropriately, with the discovery of large number of SNPs [18]. The phylogeography of the Karner blue butterfly, belonging to the genus Lycaeides, using 454-pyrosequencing has also been reported [20]. Lately, Illumina sequencing has been extremely successful in the identification of SSRs in endangered specimens within and across taxonomic groups [21,22]. Most notably, the genome assembly of the Monarch butterfly involved the use of Illumina paired-end sequencing [23]. We have used the Illumina HiSeq 2500 NGS technology to characterize the transcriptome of endangered Lycaenids S. takanosis and P. superans and annotated the genomic resources from the species for the mechanistic dissection of ecologically relevant traits. This study also bridges the gap between the genomic sequence information of model organisms vs. non-model species that are essentially the viable targets of biodiversity conservation and phylogenetics.

2. Results and Discussion

2.1. Transcriptome Analysis In order to obtain the transcriptomes of the endangered Lycaenid butterflies S. takanosis and P. superans, a cDNA library was constructed from the RNA isolated from the whole body of adult insects and sequenced on the Illumina HiSeq 2500 NGS platform. The transcriptome assembly and analysis work-flow has been depicted in Figure1. The Illumina HiSeq 2500 sequencing of P. superans generated a total of 258,875,070 raw reads (32,618,258,820 bases) with a mean length of 126 bp. The filtering of reads based on quality parameters resulted in the discard of 0.32% of bases to get a paired-end profile of 32,514,410,974 bases with an average length of 125.6 bp. After stringent quality assessment, a total of 254,340,693 clean reads (Q20 ~99% and percent of unknown nucleotide is 0%) were obtained, which represents 98.25% of the obtained raw reads. The mean length and the N50 length of the obtained clean reads were 124.3 bp and 126 bp, respectively, with GC% (or guanine-cytosine content) of 39.81%. In the case of S. takanosis, the transcriptome generated a total of 249,312,792 raw reads (31,413,411,792 bases) with a mean length of 126 bp. Adapter trimming led to a discard of 0.35% of the total base pairs processed, which, after stringent quality assessment, generated a total of 245,110,582 clean reads (30,515,812,866 bases). The mean length, N50 length, and GC% of the obtained clean reads were 124.5, 126, and 41.96%, respectively. The summary of the read processing analysis based on quality parameters is shown

29950 Int. J. Mol. Sci. 2015, 16, 29948–29970 in Table S1. The sequence reads generated from the transcriptome sequencing of P. superans and S. takanosis have been submitted to GenBank Sequence Read Archive (SRA) at National Center for Biotechnology Information (NCBI) under accession numbers SRP063812 and SRP063813, respectively. Int. J. Mol. Sci. 2015, 16, page–page

FigureFigure 1. 1. SchematicSchematic work-flowwork-flow followed for for the the transc transcriptomeriptome analysis analysis of ofLycaenid Lycaenid butterflies, butterflies, ProtantigiusProtantigius superans superansand andSpindasis Spindasis takanosistakanosis..

The processed high quality reads were assembled using the Trinity program which uses three The processed high quality reads were assembled using the Trinity program which uses sequential software modules, namely Inchworm, Chrysalis, and Butterfly, for de novo transcriptome three sequential software modules, namely Inchworm, Chrysalis, and Butterfly, for de novo assembly [24]. Trinity is an exclusive program for assembling transcript sequences from Illumina transcriptometranscriptome assembly data and [ 24scores]. Trinity over other is an de exclusive novo transcriptome program for algorithms assembling developed transcript including sequences fromSOAPdenovo-Trans Illumina transcriptome [25], Trans-ABySS data and scores [26], and over Oase others [27].de novo Withtranscriptome Trinity, a total algorithms of 159,074 developed contigs includingwere assembled SOAPdenovo-Trans for the P. superans [25], transcriptome, Trans-ABySS with [26], N50 and (contig Oases length [27]. such With that Trinity, equal or a longer total of 159,074contigs contigs amount were to half assembled of the total for the assemblyP. superans length)transcriptome, and mean length with N50 of 1220 (contig bp lengthand 746.3 such bp, that equalrespectively. or longer A contigs significant amount proportion to half of of the the total assembled assembly contigs length) (39.28%) and mean were length ≥500 ofbp 1220with bp the and 746.3longest bp, respectively.contig size of A 15,152 significant bp. A proportion total of 170,449 of the assembledcontigs were contigs assembled (39.28%) for were the ěS.500 takanosis bp with thetranscriptome, longest contig with size N50 of 15,152and mean bp. length A total of of 1372 170,449 bp and contigs 786.4were bp, respectively. assembled forOut the of S.the takanosis total transcriptome,assembled contigs, with 66,844 N50 and contigs mean (39.21%) length were of 1372 ≥500 bpbp with and 786.4the longest bp, respectively. contig size of 16,820 Out of bp. the The total assembledsize distribution contigs, of 66,844 the assembled contigs contigs (39.21%) for were the P.ě superans500 bp and with S. the takanosis longest transcriptomes contig sizeof have 16,820 been bp. Theshown size distributionin Figure 2A. of It theis evident assembled that the contigs proportion for the ofP. short superans contigsand andS. takanosiscontigs overtranscriptomes 1 kb were high have beenin our shown datasets. in Figure Also, 2theA. contig It is evident N50 value that was the foun proportiond higher ofin the short Lycaenid contigs butterfly and contigs transcriptome over 1 kb werecompared high in to our N50s datasets. obtained Also, from the transcriptome contig N50 a valuessemblies was of found distinct higher insects in the[28–30]. Lycaenid The contigs butterfly transcriptomewere finally clustered compared to toa total N50s of obtained 107,950 unigenes from transcriptome with 89,022,313 assemblies bases for of P. distinct superans insects and 121,140 [28–30 ]. Theunigenes contigs with were 100,232,710 finally clustered bases for to S. a takanosis total of. 107,950 For P. superans unigenes, the with N50 89,022,313 and mean length bases forof unigenesP. superans andwere 121,140 1452 unigenesand 824.7 withbp, respectively, 100,232,710 with bases a GC% for S. of takanosis 38.46%.. ForAmongP. superans these unigenes,, the N50 29,596 and mean (27.42%) length had a size of no more than 300 bp, 54,518 (50.50%) were in the sizes of 301–1000 bp, 13,325 (12.34%) of unigenes were 1452 and 824.7 bp, respectively, with a GC% of 38.46%. Among these unigenes, were of lengths in between 1001 and 2000 bp, and 10,511 (9.74%) were over 2000 bp. For S. takanosis, the N50 and mean length of unigene sequences were 1537 and 827.4 bp, respectively, with a GC% of 38.68%. Among these unigenes, 35,101 (28.97%)29951 had a size of ≤300 bp, 61,001 (50.36%) were in the size range of 301–1000 bp, 13,000 (10.73%) were of lengths 1001–2000 bp, and 12,038 (9.93%) were over

4 Int. J. Mol. Sci. 2015, 16, 29948–29970

29,596 (27.42%) had a size of no more than 300 bp, 54,518 (50.50%) were in the sizes of 301–1000 bp, 13,325 (12.34%) were of lengths in between 1001 and 2000 bp, and 10,511 (9.74%) were over 2000 bp. For S. takanosis, the N50 and mean length of unigene sequences were 1537 and 827.4 bp, respectively, with a GC% of 38.68%. Among these unigenes, 35,101 (28.97%) had a size of ď300 bp, 61,001 (50.36%) were in the size range of 301–1000 bp, 13,000 (10.73%) were of lengths 1001–2000 bp, and 12,038 (9.93%) were over 2000 bp. The size distribution of assembled unigenes for both the Lycaenid butterflies are shown in Figure2B. A summary of the P. superans and S. takanosis transcriptomes depicting the datasets obtained after the processing of raw reads, Trinity de novo assembly, and TGIR Gene Indices Clustering Tool (TGICL) clustering are depicted in Table1. The transcriptome sequence and assembly efficiency was better in case of S. takanosis as it resulted in greater number of transcripts from a lesser number of raw read sequences.

Table 1. Summary of transcriptome assembly after Illumina HiSeq 2500 sequencing of Lycaenid butterflies Spindasis takanosis and Protantigius superans.

Assembly Features Spindasis takanosis Protantigius superans Raw Reads Number of sequences 249,312,792 258,875,070 Number of bases 31,413,411,792 32,618,258,820 Mean length (bp) 126 126 Clean reads Number of sequences 245,110,582 254,340,693 Number of bases 30,515,812,866 31,607,701,940 Mean length (bp) 124.5 124.3 N50 length (bp) 126 126 GC% 41.96 39.81 High-quality reads (%) 98.31 (sequences), 97.14 (bases) 98.25 (sequences), 96.90 (bases) Number of reads discarded (%) 1.69 (sequences), 2.86 (bases) 1.75 (sequence), 3.1 (bases) Contig information Total number of contig 170,449 159,074 Number of bases 134,036,728 118,721,203 Mean length of contig (bp) 786.4 746.3 N50 length of contig (bp) 1372 1220 GC% of contig 38.58 38.45 Longest contig (bp) 16,820 15,152 No. of large contigs (ě500 bp) 66,844 62,485 Unigene information- Total number of unigenes 121,140 107,950 Number of bases 100,232,710 89,022,313 Mean length of unigene (bp) 827.4 824.7 N50 length of unigene (bp) 1537 1452 GC% of unigene 38.68 38.46 Length ranges (bp) 124–16,820 114–17,062

2.2. Sequence Annotation The assembled unigenes of P. superans and S. takanosis were used as query sequences and blasted (BLASTX search; E-value ď 1.0 ˆ10´5) against various protein databases, including a locally curable Protostome DB (PANM-DB), Unigene and EuKaryotic Orthologous Groups (KOG) databases. Significant matches were found for the assembled unigenes with the subject sequences in PANM-DB with 44,529 (41.25%) and 46,852 (38.68%) unigenes of P. superans and S. takanosis recovering the BLAST results. Similarly, a total of 15,331 (14.2%) unigenes of P. superans and 22,124 (18.26%) unigenes of S. takanosis found homology to sequences in the Unigene database. Roughly, 18,511 (17.14%) and 24,603 (20.31%) unigenes of P. superans and S. takanosis show BLASTX hits in the KOG database. The sequence-based annotation of unigenes from the P. superans and S. takanosis transcriptomes are shown in Table2.

29952 Int. J. Mol. Sci. 2015, 16, 29948–29970 Int. J. Mol. Sci. 2015, 16, page–page

Figure 2. Size distribution of contig (A); and unigenes (B) after assembly and clustering of the quality Figure 2. Size distribution of contig (A); and unigenes (B) after assembly and clustering of the quality reads from the transcriptomes of P. superans and S. takanosis. reads from the transcriptomes of P. superans and S. takanosis. Table 2. Sequence annotation of unigenes assembled from the Protantigius superans and Spindasis Tabletakanosis 2. Sequence transcriptomes. annotation of unigenes assembled from the Protantigius superans and Spindasis takanosis transcriptomes. All Annotated Transcripts ≤300 bp 300–1000 bp ≥1000 bp Databases P. superans S. takanosis P. superans S. takanosis P. superans S. takanosis P. superans S. takanosis All Annotated Transcripts ď300 bp 300–1000 bp ě1000 bp DatabasesPANM-DB 44,529 46,852 6272 7342 19,244 19,666 19,013 19,844 UNIGENEP. superans 15,331 S. takanosis22,124 P. superans1267 S. takanosis2751 P. superans5098 S.7848 takanosis 8966 P. superans 11,525 S. takanosis PANM-DBKOG 44,52918,511 46,85224,603 62721273 73422721 19,2445399 7971 19,666 11,839 19,013 13,911 19,844 UNIGENEGO 15,33118,661 22,12422,275 12671956 27512705 6355 5098 7566 7848 10,350 8966 12,004 11,525 KOGKEGG 18,5115259 24,6036697 1273 541 2721 897 16155399 2289 7971 3103 11,839 3511 13,911 GOALL 18,661 46,754 22,27551,908 1956 6739 2705 8,721 20,557 6355 22,559 7566 19,458 10,350 20,628 12,004 KEGG 5259 6697 541 897 1615 2289 3103 3511 ALL 46,754 51,908 6739 8721 20,557 22,559 19,458 20,628 We found that a significant proportion of longer unigene sequences (≥1000 bp) are circumstantial in returning a higher proportion of BLAST hits against various protein databases as compared to Weshort-read found thatunigenes a significant (≤300 bp). proportion In the case of of longer P. superans unigene, a total sequences of 18,506 (ě and1000 13,234 bp)are unigenes circumstantial had in returningcommon a homologous higher proportion matches ofin BLASTthe PANM-DB hits against and KOG various DB and protein PANM-DB databases and Unigene as compared DB, to short-readrespectively. unigenes A total (ď300 of 9659 bp). unigene In the sequences case of P. over superanslapped, a within total ofthe 18,506 three protein and 13,234 databases. unigenes Only had commonthree homologous unigenes sequences matches were in thefound PANM-DB exclusive andto the KOG KOG DB DB, and whereas PANM-DB two sequences and Unigene were DB, respectively.uniquely shared A total between of 9659 the unigene KOG and sequences Unigene DB. overlapped A total of 22,448 within and the 2095 three sequences protein showed databases. Onlyhomologies three unigenes exclusive sequences to the werePANM-DB found and exclusive Unigene to DB, the respectively. KOG DB, whereas The sequence two sequences annotation were results for the P. superans butterfly are shown in Figure 3A. The unigene annotation for S. takanosis uniquely shared between the KOG and Unigene DB. A total of 22,448 and 2095 sequences showed (Figure 3B) showed a total of 13,888 sequences overlapping the three databases. A total of 24,528, homologies exclusive to the PANM-DB and Unigene DB, respectively. The sequence annotation 18,538, and 13,940 unigenes recovered BLAST hits from both the PANM-DB and KOG DB, the resultsPANM-DB for the P. and superans Unigenebutterfly DB, and arethe KOG shown DB inand Figure Unigene3A. DB, The respectively. unigene annotation BLASTX was for alsoS. used takanosis (Figure3B) showed a total of 13,888 sequences overlapping the three databases. A total of 24,528, 18,538, and 13,940 unigenes recovered BLAST hits6 from both the PANM-DB and KOG DB, the PANM-DB and Unigene DB, and the KOG DB and Unigene DB, respectively. BLASTX was also used to search for the matches of P. superans and S. takanosis unigenes against Gene Ontology (GO)

29953 Int. J. Mol. Sci. 2015, 16, 29948–29970 Int. J. Mol. Sci. 2015, 16, page–page and Kyototo search Encyclopedia for the matches of of Genes P. superans and Genomes and S. takanosis (KEGG) unigenes protein against functional Gene Ontology databases. (GO) A and total of 18,661Kyoto (17.29%) Encyclopedia and 5259 of (4.87%)Genes and unigenes Genomes of (KEGG)P. superans proteinfound functional matches databases. to protein A total sequences of 18,661 in the GO and(17.29%) KEGG and databases, 5259 (4.87%) respectively unigenes of (TableP. superans2). Asfound expected, matches the to protein majority sequences (more thanin the 55%)GO and of GO and KEGGKEGG annotateddatabases, respectively transcripts (Table were ě2).1000 As expect bp. Theed, the non-annotated majority (more transcripts than 55%) of may GO also and attributeKEGG to annotated transcripts were ≥1000 bp. The non-annotated transcripts may also attribute to novel genes, novel genes, but it is possible that most of these shorter sequences may lack a functional conserved but it is possible that most of these shorter sequences may lack a functional conserved domain and domain and hence are missing sequence matches in the databases. hence are missing sequence matches in the databases.

Figure 3. A summary depicting the annotation of P. superans (A); and S. takanosis (B) unigenes against Figure 3. A summary depicting the annotation of P. superans (A); and S. takanosis (B) unigenes against PANM-DB, Unigene DB, and KOG DB. PANM-DB, Unigene DB, and KOG DB. 2.3. Homology Characteristics of Assembled Unigenes 2.3. Homology Characteristics of Assembled Unigenes The characteristics of the homology search of assembled unigenes from P. superans recovered by TheBLAST characteristics hits against the of thePANM-DB homology were search analyzed. of assembled We studied unigenes for homology from characteristicsP. superans recovered such as by BLASTthe hitsE-value, against identity the PANM-DBand similarity were distribution analyzed. (Figure We studied 4). We have forhomology also performed characteristics the homology such as the Esearch-value, of identity assembled and unigenes similarity against distribution the Unigene (Figure DB (Figure4). We S1). have The E also-value performed distribution the revealed homology that a significant proportion of unigenes (24,694, 55.46%) showed significant homology to previously search of assembled unigenes against the Unigene DB (Figure S1). The E-value distribution revealed deposited sequences in the PANM-DB with an E-value ranging from 1.0 × 10−50 to 1.0 × 10−5 (Figure that a significant proportion of unigenes (24,694, 55.46%) showed significant homology to previously 4A). Identity distribution chart revealed a close distribution with 14,728 (33.08%), 14,144 (31.76%), ´50 ´5 depositedand 10,740 sequences (24.12%) in unigenes the PANM-DB showing identity with an of E60%–80%,-value ranging 40%–60%, from and 1.080%–100%,ˆ 10 respectively.to 1.0 ˆ 10 (FigureAbout4A). 603 Identity (1.35%) distribution unigenes showed chart an revealed identity of a close100% to distribution match subject with sequences 14,728 in (33.08%), the PANM- 14,144 (31.76%),DB (Figure and 10,740 4B). According (24.12%) unigenesto the similarity showing dist identityribution chart, of 60%–80%, 20,838 (46.80%) 40%–60%, unigenes and 80%–100%,had a respectively.similarity Aboutof 80%–100% 603 (1.35%) and 17,618 unigenes (39.57%) showed unigenes an had identity a similarity of 100% of 60%–80% to match with subject the deposited sequences in thesequences PANM-DB (Figure (Figure 4C).4B). In Accordingaddition, our to results the similarity showed distributionthat the unigene chart, hit 20,838 percentage (46.80%) increased unigenes had asteadily similarity with ofan 80%–100%increase in the and length 17,618 of the (39.57%) unigenes. unigenes Above 70% had of a P. similarity superans unigene of 60%–80% sequences with the depositedover 1500 sequences bp in length (Figure showed4C). BLASTx In addition, hits to ourprotein results sequences showed in the that PANM-DB. the unigene In contrast, hit percentage only close to 20% of sequences shorter than 300 bp found a hit to homologous sequences in the database increased steadily with an increase in the length of the unigenes. Above 70% of P. superans unigene (Figure 4D). In the homology search of P. superans assembled sequences using the Unigene DB (Figure sequences over 1500 bp in length showed BLASTx hits to protein sequences in the PANM-DB. S1), we found a majority showing an E-value ranging from 1.0 × 10−50 to 1.0 × 10−5 (Figure S1A). The In contrast,identity only distribution close to plot 20% revealed of sequences most shorterassembled than sequences 300 bp foundshowing a hitan toidentity homologous of 80%–100% sequences in the(Figure database S1B) (Figureto sequences4D). Inin the the Unigene homology DB. searchAbout 50% of P.of superanssequencesassembled >2001 bp showed sequences annotation using the Unigenehits, while DB (Figure none of S1),the <200 we bp found sequences a majority were annotated showing to an sequencesE-value in ranging the database from (Figure 1.0 ˆ S1C).10´ 50 to 1.0 ˆ 10´5 (Figure S1A). The identity distribution plot revealed most assembled sequences showing an identity of 80%–100% (Figure S1B) to sequences in the Unigene DB. About 50% of sequences 7 >2001 bp showed annotation hits, while none of the <200 bp sequences were annotated to sequences in the database (Figure S1C).

29954 Int. J. Mol. Sci. 2015, 16, 29948–29970

Int. J. Mol. Sci. 2015, 16, page–page

Figure 4. Homology search characteristics of P. superans assembled unigenes against PANM database. Figure 4. Homology search characteristics of P. superans assembled unigenes against PANM database. (A) E-value distribution of BLAST hits for each unigene with a cutoff value of 1.0 × 10−5; (B) Identity (A) E-value distribution of BLAST hits for each unigene with a cutoff value of 1.0 ˆ 10´5;(B) Identity distribution of BLAST hits for each unigene; (C) Similarity distribution of BLAST hits for each distribution of BLAST hits for each unigene; (C) Similarity distribution of BLAST hits for each unigene; unigene; (D) Unigene lengths with or without hits. (D) Unigene lengths with or without hits. We have summarized the characteristics of the homology search of assembled unigenes from S. Wetakanosis have in summarized Figure 5. The the E-value characteristics distribution of chart the homologyrevealed that search aboutof 22,791 assembled (48.64%) unigenes unigenes from −50 −5 S. takanosisshowedin an Figure E-value5. ranging The E-value from 1.0 distribution × 10 to 1.0 chart × 10 revealed(Figure 5A). that As about with P. 22,791 superans (48.64%), S. takanosis unigenes unigenes reveal close identity distribution with 15,658 (33.42%), 13,138 (28.04%), and 12,843 (27.41%) showed an E-value ranging from 1.0 ˆ 10´50 to 1.0 ˆ 10´5 (Figure5A). As with P. superans, S. takanosis unigenes showing identity in the range of 60%–80%, 80%–100%, and 40%–60%, respectively. About unigenes reveal close identity distribution with 15,658 (33.42%), 13,138 (28.04%), and 12,843 (27.41%) 670 (1.43%) unigenes showed an identity of 100% to subject sequences in the PANM-DB (Figure 5B). unigenesThe similarity showing distribution identity in chart the range shows of a larger 60%–80%, share 80%–100%,of unigenes (24,613, and 40%–60%, 52.53%) having respectively. similarity About 670 (1.43%)in the range unigenes of 80%–100% showed (Figure an identity 5C). Unlike of 100% the toP. superans subject sequencesunigenes hit in percentage, the PANM-DB almost(Figure 50% of 5B). The similaritythe unigenes distribution in S. takanosis chart with shows a length a larger of less share than of200 unigenes bp showed (24,613, a positive 52.53%) hit. Following having similarity the in thesame, range the of hit 80%–100% percentage (Figure increased5C). by Unlike about the90% P.for superans unigenesunigenes above 2001 hit bp percentage, in length (Figure almost 5D). 50% of the unigenesHomology in searchS. takanosis of S. takanosiswith a assembled length of sequences less than 200using bp the showed Unigene a DB positive (Figure hit. S2) Followingshowed a the same,majority the hit of percentage sequences having increased an E by-value about ranging 90% from for unigenes 1.0 x 10-50 to above 1.0 x 10 2001-5 (Figure bp in S2A). length A majority (Figure5 D). Homologyof sequences search showed of S. takanosis an identityassembled of 80%–100% sequences (Figure using S2B) theto sequences Unigene DBin the (Figure Unigene S2) DB. showed Assembled sequences of >2001 bp showed about 60% annotation hits to sequences in the Unigene DB a majority of sequences having an E-value ranging from 1.0 ˆ 10´50 to 1.0 ˆ 10´5 (Figure S2A). (Figure S2C). Moreover, more unigene hits were observed with the PANM-DB compared to the A majority of sequences showed an identity of 80%–100% (Figure S2B) to sequences in the Unigene Unigene DB over the length of assembled sequences. This indicates that longer unigenes are more DB. Assembledlikely to get sequences an identifiable of >2001 affiliation bp showed during about BLAST 60% matches, annotation due hits to tothe sequences likely presence in the of Unigene a DB (Figurerepresentative S2C). Moreover,protein domain more that unigene may be hitshard wereto find observed in shorter withsequences. the PANM-DB More so, the compared longest to the Unigeneunigenes DByield over BLAST the hits length and annotati of assembledons with sequences. a higher frequency This indicates [31,32]. that longer unigenes are more likelyThe to BLASTx get an top-hit identifiable species affiliation distribution during of unigenes BLAST matched matches, to the due PANM-DB to the likelyhas been presence shown of a representativefor P. superans protein and S. domain takanosis that in Figure may be 6A,B, hard respectively. to find in In shorter the case sequences. of both the Morebutterfly so, species, the longest unigenesthe highest yield BLASTmatch was hits observed and annotations to Danaus with plexippus a higher (14,239 frequency unigenes, [31 13.19%,,32]. for P. superans; and The12,316 BLASTx unigenes, top-hit 10.17%, species for S. distributiontakanosis), followed of unigenes by Bombyx matched mori (10,244 to the unigenes, PANM-DB 9.49%; has and been 8601 shown unigenes, 7.10%). Other species ranked high with a greater number of hits included the for P. superans and S. takanosis in Figure6A,B, respectively. In the case of both the butterfly species, such as Plutella xylostella, Pararge aegeria and the mollusk Aplysia californica among others with more the highest match was observed to Danaus plexippus (14,239 unigenes, 13.19%, for P. superans; than 1000 unigene hits. We also analyzed the top-hit species distribution of the 15,331 and 22,124 and 12,316assembled unigenes, sequences 10.17%, for P. superans for S. takanosis(Figure 6C)), followedand S. takanosis by Bombyx (Figure 6D) mori matched(10,244 to unigenes, the Unigene 9.49%; and 8601 unigenes, 7.10%). Other species ranked high with a greater number of hits included the Arthropods such as Plutella xylostella, Pararge aegeria8 and the mollusk Aplysia californica among others with more than 1000 unigene hits. We also analyzed the top-hit species distribution of the 15,331 and

29955 Int. J. Mol. Sci. 2015, 16, 29948–29970

Int. J. Mol. Sci. 2015, 16, page–page 22,124 assembled sequences for P. superans (Figure6C) and S. takanosis (Figure6D) matched to the Int. J. Mol. Sci. 2015, 16, page–page Unigenedatabase. database. The highest The highest BLASTx BLASTx hits were hits shown were shownto Bombyx to Bombyxmori with mori 9402with and 9402 9689 and unigenes 9689unigenes of P. of P. superans and S. takanosis, respectively, showing matches. superansdatabase. and The S. highest takanosis BLAS, respectively,Tx hits were showing shown matches. to Bombyx mori with 9402 and 9689 unigenes of P. superans and S. takanosis, respectively, showing matches.

Figure 5. Homology search characteristics of S. takanosis assembled unigenes against PANM database. Figure 5. Homology search characteristics of S. takanosis assembled unigenes against PANM database. (A) E-value distribution of BLAST hits for each unigene with a cutoff value of 1.0 ×10−5; (B) Identity (A) EFigure-value 5. distribution Homology search of BLAST characteristics hits for eachof S. takanosis unigene assembled with a cutoff unigenes value against of 1.0 PANMˆ 10´ 5database.;(B) Identity distribution(A) E-value distributionof BLAST hitsof BLAST for each hits unigene;for each unigene(C) Similarity with a cutoffdistribution value ofof 1.0 BLAST ×10−5 ;hits (B) Identityfor each distribution of BLAST hits for each unigene; (C) Similarity distribution of BLAST hits for each unigene; unigene;distribution (D) ofUnigene BLAST lengths hits for with each or withoutunigene; hits. (C) Similarity distribution of BLAST hits for each (D) Unigene lengths with or without hits. unigene; (D) Unigene lengths with or without hits.

Figure 6. Top-hit species distribution. BLASTx top-hit species distribution of P. superans and S. takanosis FigureagainstFigure 6. 6. PANM-DBTop-hit Top-hit species species (A,B );distribution. and distribution. Unigene BLAST DB BLASTx(xC top-hit,D). species top-hit distribution species distribution of P. superans of andP. S. superans takanosis and S. takanosisagainst againstPANM-DB PANM-DB (A,B); and (A Unigene,B); and DB Unigene (C,D). DB (C,D). 9 9 29956 Int. J. Mol. Sci. 2015, 16, page–page Int. J. Mol. Sci. 2015, 16, 29948–29970 Functional prediction and classification of the butterfly unigenes were achieved by a search againstFunctional the KOG prediction database (Figure and classification 7). A total of of 18,511 the butterfly (17.15% unigenesof total unigenes) were achieved and 24,603 by a(20.31% search ofagainst total unigenes) the KOG database unigenes (Figure of P. superans7). A total (Figure of 18,511 7A) and (17.15% S. takanosis of total (Figure unigenes) 7B), and respectively 24,603 (20.31% were ascribedof total unigenes) functions unigenesunder 25 categories of P. superans arranged(Figure to7 A)four and mainS. functional takanosis (Figure groups.7B), A greater respectively proportion were ofascribed unigenes functions were underclustered 25 categories to the cellular arranged processes to four mainand signaling functional group groups. (4985 A greater unigenes proportion for P. superansof unigenes and were 6981 clustered unigenes to for the S. cellular takanosis processes), followed and by signaling the metabolism group (4985 group unigenes (3408 for for P.P. superans and 4882 6981 for unigenes S. takanosis for S.) and takanosis the information), followed bystorage the metabolism and processing group group (3408 (2868 for P. for superans P. superansand 4882and 3801for S. for takanosis S. takanosis) and). the Within information the cellular storage processes and processing and signaling group group, (2868 formostP. superansof the unigenesand 3801 were for classifiedS. takanosis under). Within the the signal cellular transduction processes and proce signalingss category group, followed most of the by unigenes the post-translational were classified modification,under the signal protein transduction turnover processand chaperone category categories. followed byFurthermore, the post-translational a significant modification, fraction of unigenesprotein turnover remained and poorly chaperone characterized categories. (5275 Furthermore, unigenes afor significant P. superans fraction and of6169 unigenes unigenes remained for S. takanosispoorly characterized). (5275 unigenes for P. superans and 6169 unigenes for S. takanosis).

Figure 7. ClustersClusters of of orthologous orthologous groups’ groups’ classification classification (KOG) of P.P. superans (A); and S. takanosis (B) unigenes into four major categories of information storage and processing, cellular processes and signaling and metabolism. The The code code descriptions descriptions for for KOG categories are as follows: J, translation, ribosomal structure, structure, and and biogenesis; biogenesis; A, A, RNA RNA proc processingessing and and modification; modification; K, K,transcription; transcription; L, replication,L, replication, recombination, recombination, and repair; and repair; B, chromatin B, chromatin structure structure and dynamics; and dynamics; D, cell cycle D, control, cell cycle cell division,control, celland chromosome division, and portioning; chromosome Y, nuclear portioning; structure; Y, V, nuclear defense structure; mechanisms; V, defense T, signal mechanisms; transduction mechanisms;T, signal transduction M, cell wall/membrane/envelope mechanisms; M, cell wall/membrane/envelope biogenesis; N, cell motility; biogenesis; Z, cytoskeleton; N, cell motility; W, extracellularZ, cytoskeleton; structures; W, extracellular U, intracellular structures; traffickin U, intracellularg, secretion, trafficking, and vesicular secretion, transport; and vesicularO, post- translationaltransport; O, modification, post-translational protei modification,n turnover, proteinand chaperones; turnover, C, and en chaperones;ergy production C, energy and conversion; production G,and carbohydrate conversion; G,transport carbohydrate and metabolism; transport and E, metabolism;amino acid transport E, amino acidand transportmetabolism; and F, metabolism; nucleotide transportF, nucleotide and transport metabolism; and metabolism; H, co-enzyme H, co-enzyme transport transportand metabolism; and metabolism; I, lipid I, transport lipid transport and metabolism;and metabolism; P, inorganic P, inorganic ion iontransport transport and and metabolism; metabolism; Q, Q, secondary secondary metabolites metabolites biosynthesis, transport and catabolism; R, genera generall function prediction only; S, un unknownknown function; Multi, more than one classified classified function.

2995710 Int. J. Mol. Sci. 2015, 16, 29948–29970

2.4. FunctionalInt. J. Mol. Sci. Annotation 2015, 16, page–page Using GO and KEGG

To2.4. functionallyFunctional Annotation classify UsingP. superans GO and andKEGGS. takanosis unigenes, GO terms and KEGG pathway classification were assigned to each unigene using the BLAST2GO software suite. For P. superans, To functionally classify P. superans and S. takanosis unigenes, GO terms and KEGG pathway the 18,661 unigenes annotated from the total unigene profile were allocated one or more GO terms classification were assigned to each unigene using the BLAST2GO software suite. For P. superans, the based on sequence similarity. A total of 89,289 unigene sequences were without GO terms. The GO 18,661 unigenes annotated from the total unigene profile were allocated one or more GO terms based annotatedon sequence unigenes similarity. were functionally A total of classified89,289 unig intoene three sequences broad were categories without such GO as terms. biological The process,GO cellularannotated component, unigenes and were molecular functionally function. classified A summary into three broad of P. superanscategoriesunigenes such as biological and GO process, terms have beencellular shown component, in Figure8 .and The molecular molecular function. function A summary category of wasP. superans assigned unigenes 16,200 and unigenes, GO terms followedhave by biologicalbeen shown process in Figure with 8. 11,757,The molecular and cellular function components category was with assigned 6252 16 unigenes.,200 unigenes, Of followed these unigenes, by 3880 showedbiological functionalprocess with attributes 11,757, and shared cellular within components the three with main 6252 unigenes. categories. Of these Additionally, unigenes, 5448,3880 808 and 737showed unigenes functional were attributes found uniquely shared within attached the th toree the main molecular categories. function, Additionally, cellular 5448, component, 808 and 737 and biologicalunigenes process were categories,found uniquely respectively attached to (Figure the mole8A).cular As function, shown cellular in Figure component,8B, a significant and biological number process categories, respectively (Figure 8A). As shown in Figure 8B, a significant number of unigenes of unigenes were represented by more than a single GO term. Only 5492 (29.43%) unigenes were were represented by more than a single GO term. Only 5492 (29.43%) unigenes were ascribed to one ascribed to one GO term with most unigenes (5760, 30.87%) represented by two GO terms. GO term with most unigenes (5760, 30.87%) represented by two GO terms.

Figure 8. The functional distribution of the assembled unigenes of Protantigius superans by gene Figure 8. The functional distribution of the assembled unigenes of Protantigius superans by gene ontology assignment. (A) The distribution of the GO-annotated unigenes to biological process, ontology assignment. (A) The distribution of the GO-annotated unigenes to biological process, cellular components and molecular function categories; (B) The number of GO term annotations cellular components and molecular function categories; (B) The number of GO term annotations ascribed to each unigene. ascribed to each unigene. All unigene sequences were classified into molecular function, cellular components, and Allbiological unigene process sequences level 2. wereWithin classified the molecular into molecularfunction category, function, the cellularunigenes components, were further and biologicalclassified process to 13 level sub-categories, 2. Within the out molecular of which function a majority category, were represented the unigenes under were binding further (9502 classified unigenes, 46.89%), catalytic activity (7611, 37.56%), and transporter activity (1263, 6.23%). About 61 to 13 sub-categories, out of which a majority were represented under binding (9502 unigenes, sequences represented the antioxidant activity with very few transcripts assigned to protein tag, 46.89%), catalytic activity (7611, 37.56%), and transporter activity (1263, 6.23%). About 61 sequences nutrient reservoir activity, and metallochaperone activity (Figure 9A). Within the cellular representedcomponents the antioxidantcategory, the activitymajority withof unigenes very fewrepresented transcripts cell (3530, assigned 31.57%), to proteinmembrane tag, (3033, nutrient reservoir27.12%), activity, organelle and (2458, metallochaperone 21.98%) and macromolecular activity (Figure complex9A). (1510, Within 13.50%). the cellularFew unigenes components also category,represented the majority synapse of unigenes(50, 0.45%) represented and extracellular cell (3530, matrix 31.57%), (49, 0.44%) membrane (Figure 9B). (3033, For 27.12%), the biological organelle (2458,process 21.98%) category, and macromolecular there were 19 sub-categories, complex (1510, and metabolic 13.50%). process Few unigenes (8303, 27.86%), also represented cellular process synapse (50, 0.45%)(7985, 26.79%) and extracellular and single-organis matrixm (49,process 0.44%) (18.18%) (Figure were9 B).the Forpredominant the biological GO groups. process A smaller category, thereproportion were 19 sub-categories, of unigenes also and fell metabolicunder response process to stimulus (8303, 27.86%),(1447, 4.86%), cellular signaling process (1081, (7985, 3.63%), 26.79%) and single-organismreproduction (41, process0.14%), reproductive (18.18%) were process the (37, predominant 0.12%), and GOimmune groups. system A process smaller (29, proportion 0.10%) of (Figure 9C). unigenes also fell under response to stimulus (1447, 4.86%), signaling (1081, 3.63%), reproduction In the case of S. takanosis, 22,275 unigenes (18.39% of assembled unigenes) were assigned one or (41, 0.14%), reproductive process (37, 0.12%), and immune system process (29, 0.10%) (Figure9C). more GO terms, while 98,865 unigenes were without GO terms. The GO distribution for S. takanosis Inunigenes the case assigned of S. takanosis 16,200 terms, 22,275 to the unigenes molecular (18.39% functions of assembled category, followed unigenes) by 11,757 were assignedterms to the one or morebiological GO terms, process while and 98,865 6252 unigenesterms to the were cellular without components GO terms. categories The GO(Figure distribution 10A). A total for ofS. 6737, takanosis unigenes assigned 16,200 terms to the molecular functions category, followed by 11,757 terms to the biological process and 6252 terms to the cellular11 components categories (Figure 10A). A total

29958 Int. J. Mol. Sci. 2015, 16, 29948–29970

Int. J. Mol. Sci. 2015, 16, page–page of 6737, 978, and 757 unigenes were found exclusive to molecular function, cellular components, and 978, and 757 unigenes were found exclusive to molecular function, cellular components, and biological process categories, respectively. About 4932 unigenes showed functional attributes shared biological process categories, respectively. About 4932 unigenes showed functional attributes shared between the three major categories of GO terms. As with P. superans, a greater number S. takanosis betweenInt. J. Mol. the Sci. three 2015, 16major, page–page categories of GO terms. As with P. superans, a greater number S. takanosis unigenesunigenes (15,486) (15,486) also also got got represented represented byby moremore thanthan one GO GO term. term. Only Only 6789 6789 (30.48%) (30.48%) unigenes unigenes 978, and 757 unigenes were found exclusive to molecular function, cellular components, and showedshowed homology homology to ato single a single GO GO term term (Figure (Figure 1010B).B). biological process categories, respectively. About 4932 unigenes showed functional attributes shared between the three major categories of GO terms. As with P. superans, a greater number S. takanosis unigenes (15,486) also got represented by more than one GO term. Only 6789 (30.48%) unigenes showed homology to a single GO term (Figure 10B).

Figure 9. Gene ontology analysis of P. superans transcriptome using BLAST2GO. The number of Figure 9. Gene ontology analysis of P. superans transcriptome using BLAST2GO. The number of unigenes assigned to the sub-categories under three major categories of molecular function (A); unigenes assigned to the sub-categories under three major categories of molecular function (A); cellular components (B); and biological processes (C). All data are presented at level 2 GO categorization. cellularFigure components 9. Gene ontology (B); and analysis biological of P. superans processes transcriptome (C). All using data BLAST2GO. are presented The number at of level 2 unigenes assigned to the sub-categories under three major categories of molecular function (A); GO categorization. cellular components (B); and biological processes (C). All data are presented at level 2 GO categorization.

Figure 10. The functional distribution of the assembled unigenes of Spindasis takanosis by gene ontology assignment. (A) The distribution of the GO-annotated unigenes to biological processes, Figure 10. The functional distribution of the assembled unigenes of Spindasis takanosis by gene Figurecellular 10. componentsThe functional and molecular distribution function of the categories; assembled (B) unigenesThe number of Spindasisof GO term takanosis annotationsby gene ontology assignment. (A) The distribution of the GO-annotated unigenes to biological processes, ontologyascribedcellular assignment. to components each unigene. (A and) The molecular distribution function of categories; the GO-annotated (B) The number unigenes of GO to term biological annotations processes, cellularascribed components to each unigene. and molecular function categories;12 (B) The number of GO term annotations ascribed to each unigene. 12

29959 Int. J. Mol. Sci. 2015, 16, 29948–29970

Int. J. Mol. Sci. 2015, 16, page–page The number of S. takanosis unigenes ascribed to various sub-categories under the three major categoriesThe ofnumber GO terms of S. (leveltakanosis 2) unigenes are shown ascribed in Figure to various 11. Under sub-categories the molecular under function the three category, major bindingcategories (11,887 of GO unigenes, terms (level 48.08%) 2) are and shown catalytic in Figure activity 11. Under (8845 the unigenes, molecular 35.78%) function were category, majorly represented.binding (11,887 A total unigenes, of 83 unigene 48.08%) sequences and catalytic were activity ascribed (8845 to antioxidant unigenes, 35.78%) activity withwere onlymajorly three transcriptsrepresented. linked A total to metallochaperoneof 83 unigene sequences activity were (Figure ascribed 11A). to antioxidant Regarding activity the cellular with only components three category,transcripts a high linked proportion to metallochaperone of unigenes activity were ascribed (Figure to11A). cell Regarding (4373 unigenes, the cellular 31.67%), components membrane (3726,category, 26.99%), a high organelle proportion (2894, of 20.96%),unigenes andwere macromolecular ascribed to cell complex(4373 unigenes, (2079, 15.06%)31.67%), (Figuremembrane 11B). Out(3726, of the 26.99%), 35,050 organelle unigene hit(2894, to the20.96%), biological and macromolecular process category, complex 20 sub-categories (2079, 15.06%) were (Figure represented. 11B). AOut high of proportion the 35,050 unigene of sequences hit to the showed biological homology process to category, metabolic 20 sub-categories process (9482 were unigenes, represented. 27.05%), cellularA high process proportion (8773, of 25.03%)sequences and showed single-organismal homology to processmetabolic (6937 process unigenes, (9482 unigenes, 19.79%). 27.05%), A total of 1305,cellular 39, 32, process and23 (8773, unigene 25.03%) sequences and single-organisma fell under signaling,l process reproduction,(6937 unigenes, reproductive 19.79%). A total process, of 1305, and immune39, 32, systemand 23 processunigene (Figure sequences 11C). fell In under discussing signaling, the dominant reproduction, GO terms reproductive for Lycaenid process, butterflies and P. superansimmune systemand S. process takanosis (Figure, we find 11C). that In discussing the profiles the are dominant very similar. GO terms The for prominence Lycaenid butterflies of the GO P. superans and S. takanosis, we find that the profiles are very similar. The prominence of the GO biological process category over the GO molecular function and cellular components categories is biological process category over the GO molecular function and cellular components categories is understood in related species [33]. Also, the dominance of metabolic and cellular process under understood in related species [33]. Also, the dominance of metabolic and cellular process under the the GO biological process category has been consistently predicted in other Lepidopteran species. GO biological process category has been consistently predicted in other Lepidopteran species. GO GO classification for functions in the sugarcane giant borer (Telchin licus licus) transcriptome show classification for functions in the sugarcane giant borer (Telchin licus licus) transcriptome show over over 50% of GO terms represented under metabolic and cellular processes [34]. Consistent with our 50% of GO terms represented under metabolic and cellular processes [34]. Consistent with our observations,observations, the the most most prominent prominent GOGO molecularmolecular functionfunction categories categories include include binding binding and and catalytic catalytic activityactivity and and the the most most prominent prominent GOGO cellularcellular componentscomponents are are cell, cell, organe organellelle and and macromolecular macromolecular complexcomplex in in Lepidoptera and and other other representative representative insectsinsects [[33,35]33,35].

FigureFigure 11. 11.Gene Gene ontology ontology analysisanalysis ofof S.S. takanosis transcriptometranscriptome using using BLAST2GO. BLAST2GO. The The number number of of unigenesunigenes assigned assigned toto the the sub-categories sub-categories under under three three major major categories categories of molecular of molecular function ( functionA); cellular (A ); cellularcomponents components (B); and biological (B); and processes biological (C). processes All data are (C presented). All data at level are 2 GO presented categorization. at level 2 GO categorization.

13 29960 Int. J. Mol. Sci. 2015, 16, 29948–29970

Int. J. Mol. Sci. 2015, 16, page–page Furthermore, the unigenes were searched against the KEGG database for the identification of biologicalFurthermore, pathways active the unigenes in the Lycaenidwere searched butterflies against underthe KEGG investigation. database for In theP. identification superans, a totalof of 5259biological unigenes pathways were assigned active in to the 116 Lycaenid pathways. butterflies Among under them, investigation. 709 enzymes In P. were superans assigned, a total to of these pathways.5259 unigenes The number were ofassigned unigenes to 116 assigned pathways. to the Among main them, pathways 709 enzymes have been were presented assigned into Figurethese 12. The unigenespathways. predominantly The number of unigenes fall into assigned the metabolism to the main (5013 pathways unigenes, have 95.32%) been presented group, in followed Figure 12. by the organismalThe unigenes systems predominantly (118 unigenes, fall into 2.24%), the metaboli geneticsm information (5013 unigenes, processing 95.32%) (68group, unigenes, followed 1.29%), by the and environmentalorganismal informationsystems (118 unigenes, processing 2.24%), (60 unigenes, genetic information 1.14%) groups. processing Among (68 unigenes, the metabolism 1.29%), and group, environmental information processing (60 unigenes, 1.14%) groups. Among the metabolism group, the majority of the unigenes were ascribed to the nucleotide metabolism sub-group (1655, 31.47%) the majority of the unigenes were ascribed to the nucleotide metabolism sub-group (1655, 31.47%) followedfollowed by the by metabolismthe metabolism of of co-factors co-factors and and vitaminsvitamins (1017, (1017, 19.34%). 19.34%). Apart Apart from from the themetabolism metabolism group,group, the unigenesthe unigenes exclusively exclusively fell fell under under thethe translation sub-group sub-group for for the the genetic genetic information information processingprocessing group, group, the the signal signal transduction transductionsub-group sub-group for for the the envi environmentalronmental information information processing processing groupgroup and theand immunethe immune system system sub-group sub-group for for the the organismalorganismal systems systems group. group.

FigureFigure 12. 12.KEGG KEGG pathway pathwayassignment assignment for for P.P. superans superans transcriptome.transcriptome.

The mapping of S. takanosis annotated unigenes to typical KEGG pathways identified a total of The mapping of S. takanosis annotated unigenes to typical KEGG pathways identified a total of 6697 assembled sequences assigned to 119 pathways. Among them, 800 enzymes were assigned to 6697these assembled pathways. sequences A total of assigned 6438 (96.13%) to 119 unigenes pathways. belonged Among to the metabolism them, 800 pathway, enzymes out were of which assigned to these1887 pathways.unigenes fell Aunder total nucleotide of 6438 metabolism (96.13%) unigenesand 1178 unigenes belonged fell to under the metabolismthe metabolism pathway, of out ofcofactors which and 1887 vitamins unigenes sub-group. fell under A total nucleotide of 128 unigenes metabolism (1.91%) and were 1178 represented unigenes fellunder under the the metabolismimmune of system cofactors pathway and under vitamins the category sub-group. of orga A totalnismal of systems. 128 unigenes A total (1.91%)of 86 and were 45 unigenes represented underfell the under immune the translation system and pathway signal transduction under the category pathways of of organismalthe genetic information systems. processing A total of and 86 and 45 unigenesenvironmental fell under information the translation processing and categories signal transduction, respectively pathways(Figure 13). of The the KEGG genetic pathways information processingidentified and in environmental S. takanosis but information not in P. processing superans were categories, chlorocyclohexane respectively and (Figure chlorobenzene 13). The KEGG degradation, flavone and flavonol biosynthesis, fluorobenzoate degradation and steroid degradation, pathways identified in S. takanosis but not in P. superans were chlorocyclohexane and chlorobenzene while the betalain biosynthesis pathway was observed in only P. superans. degradation, flavone and flavonol biosynthesis, fluorobenzoate degradation and steroid degradation, The GO classification, in a stricter sense, does not mean evidence of functionality; instead, it only whilesuggests the betalain that a biosynthesis unigene sequence pathway can wasbe grouped observed to inthose only of P.known superans (or. predicted) function. An Theanalysis GO to classification, consider is the in evidence a stricter code sense, associat doesed with not each mean GO evidence term. We offind functionality; that a majority instead, of it onlyGO suggests terms (above that a99% unigene in both sequence the sequenced can be Lycaen groupedids) represented to those of in known our study (or predicted)are assigned function. the An analysiscode “IEA” to consider (inferred isfrom the electronic evidence annotation), code associated which are with not each manually GO term. curated We and find probably that a may majority of GOcontain terms more (above false 99% positives. in both This the is true, sequenced as out of Lycaenids) the over 16 representedmillion GO annotations in our study as of areOctober assigned the code2007, “IEA” 15,687,382 (inferred are in fromfact computationally electronic annotation), derived IEA which codes are [36]. not Hence, manually in the curated discussion and of probably GO mayterm contain annotations more false for the positives. study, we Thisemphasize is true, that as not out all ofGO the terms over are 16of equal million validity GO annotationsand, based as of October 2007, 15,687,382 are in fact computationally14 derived IEA codes [36]. Hence, in the discussion of GO term annotations for the study, we emphasize that not all GO terms are of

29961 Int. J. Mol. Sci. 2015, 16, 29948–29970 equalInt. validity J. Mol. Sci. and, 2015, 16 based, page–page on this, the interpretations of unigenes relate only to predicted function. Additionally, we presume at this point that, using BLASTx, unigene sequences are found to share on this, the interpretations of unigenes relate only to predicted function. Additionally, we presume homology with known pathway genes in the KEGG database. In addition, it is critical to study at this point that, using BLASTx, unigene sequences are found to share homology with known bothpathway the partial genes and in full-lengththe KEGG database. unigene sequencesIn addition, atit theis critical functional to study level both for the major partial applications and full- of transcriptomelength unigene sequencing. sequences at the functional level for major applications of transcriptome sequencing.

FigureFigure 13. 13.KEGG KEGG pathway pathway assignmentassignment for for S.S. takanosis takanosis transcriptome.transcriptome.

2.5. Protein2.5. Protein Domain Domain Analysis Analysis InterProScan searches were conducted on the identified 107,950 P. superans unigenes by InterProScan searches were conducted on the identified 107,950 P. superans unigenes by BLAST2GO. We discovered a total of 154,298 protein domains that include a maximum of 10,978 BLAST2GO. We discovered a total of 154,298 protein domains that include a maximum of 10,978 C2H2-like zinc finger domains. A summary of top 40 domains predicted in the P. superans C2H2transcriptome-like zinc finger has been domains. shown in A Table summary S2. The ofnotable top 40conserved domains protein predicted domains in included the P. superansthe transcriptomeC2H2-like zinc has finger, been protein shown kinase in Table, WD40 S2. (also The known notable as conservedWD or β-transducin protein domainsrepeats), ABC- included the Ctransporter,2H2-like zinc EGF-like finger, (epidermal protein growth kinase, factor- WD40like), (also Immunoglobulin-like, known as WD and or βfibronectin-transducin type-III repeats), ABC-transporter,domains. Among EGF-like the enlisted (epidermal top 40, we growth also found factor-like), cytochrome Immunoglobulin-like, P450, insect cuticle protein, and G protein- fibronectin type-IIIcoupled domains. receptor, Among UDP-glucosyltransferase the enlisted top 40, (Uridi we alsone founddiphosphate-glucosyltransferase), cytochrome P450, insect and cuticle major protein, G protein-coupledfacilitator superfamily receptor, domains. UDP-glucosyltransferase (Uridine diphosphate-glucosyltransferase), As with P. superans, a protein domain classification of S. takanosis transcripts was identified using and major facilitator superfamily domains. BLAST2GO with the top 40 InterPro domains represented in Table S3. With the notable presence of As with P. superans, a protein domain classification of S. takanosis transcripts was identified using the C2H2-like zinc finger domain, WD40 repeat domain, Armadillo-like helical domain, and BLAST2GOcytochrome with P450 the topdomain, 40 InterPro as also domainsobserved representedwith P. superans in Table, some S3. other With functional the notable domains presence were of the C2H2characteristic-like zinc finger of S. domain,takanosis unigenes. WD40 repeat domain, Armadillo-like helical domain, and cytochrome P450 domain,The C-type as also lectin observed domain, with cadherinP. superans domain, some and other the functionalthioredoxin-like domains fold weredomain characteristic showed of S. takanosissignificantunigenes. S. takanosis unigene hits. The C-type lectin domain proteins are known to be encoded by a Thenumber C-type of genes lectin in insects domain, with cadherinstrategic functions domain in and the regulation the thioredoxin-like of antimicrobial fold activity, domain proPO showed significantactivation,S. takanosis and otherunigene associated hits. immune The functions C-type lectin [37,38]. domain proteins are known to be encoded by a numberThe protein of genes kinases in insects and zinc with fing strategicer domains functions show conspicuous in the regulation presence ofin antimicrobialother invertebrate activity, transcriptomes as these process cellular functions including survival, differentiation and apoptosis proPO activation, and other associated immune functions [37,38]. [31,39,40]. The transcriptome abundant and conserved C2H2-like zinc finger domains exists in Theproteins protein as multiple kinases tandem and pairs zinc of zinc finger fingers domains or tandem showarrays of conspicuous three or more zinc presence fingers,in andother invertebratehence are transcriptomes often represented as by these few processproteins cellularin surveyed functions species. including These proteins survival, are most differentiation likely and apoptosisDNA-binding [31 ,transcription39,40]. The transcriptomefactors but can abundantalso bind to and RNA conserved and other C protei2H2-liken targets zinc [41,42]. finger domainsThe existspresence in proteins of WD40 as multiple repeat and tandem Armadillo-like pairs of helical zinc fingersdomains or in tandem172 and 106 arrays unigenes of three of P. orsuperans more zinc fingers,is consistent and hence with are oftensimilar represented analysis in byinsects few proteinsand crustaceans in surveyed [43,44]. species. The immunoglobulin These proteins areand most likelyfibronectin DNA-binding type-III transcription domains are factorsbasically but involved can also in bindcell-signaling to RNA mechanisms and other protein related targetsto cellular [41 ,42]. The presence of WD40 repeat and Armadillo-like15 helical domains in 172 and 106 unigenes of

29962 Int. J. Mol. Sci. 2015, 16, 29948–29970

P. superans is consistent with similar analysis in insects and crustaceans [43,44]. The immunoglobulin and fibronectin type-III domains are basically involved in cell-signaling mechanisms related to cellular processes and immunity [31,45]. While cytochrome P450 genes are predominantly involved in theInt. metabolism J. Mol. Sci. 2015, 16 of, page–page xenobiotics in mollusks, polychaete and crustaceans, these sequences are categorized as “environmental response genes” in insects [46]. The protein domain information processes and immunity [31,45]. While cytochrome P450 genes are predominantly involved in the provides vital clues to understanding the mechanisms of cellular survival and signaling mechanisms metabolism of xenobiotics in mollusks, polychaete and crustaceans, these sequences are categorized P. superans S. takanosis leadingas “environmental to adaptation inresponse the endangered genes” in Lycaenidinsects [46]. butterflies, The protein domain informationand provides. vital clues to understanding the mechanisms of cellular survival and signaling mechanisms leading to 2.6. Discovery of Microsatellites adaptation in the endangered Lycaenid butterflies, P. superans and S. takanosis. We used the MISA (MicroSAtellite identification tool) Perl script to explore the SSR profiles in unigenes2.6. Discovery of Lycaenid of Microsatellites butterflies P. superans and S. takanosis. In the case of P. superans, out of a total ofWe 107,950 used the unigenes MISA (MicroSAtellite investigated, identification 400,330 SSRs to wereol) Perl detected. script to explore A total the of SSR 89,877 profiles sequences in containedunigenes SSRs of Lycaenid with 66,187 butterflies (61.31%) P. superans sequences and S. containing takanosis. In more the case than of oneP. superans SSR. After, out of eliminating a total the mono-nucleotideof 107,950 unigenes repeats investigated, (18,116 400,330 number SSRs of were SSRs) detected. and deca-nucleotide A total of 89,877 sequences repeats (50 contained number of SSRs),SSRs a total with of 66,187 382,164 (61.31%) SSRs weresequences obtained. containing The more di-nucleotide than one repeatsSSR. After were eliminating predominant the mono- (294,244, nucleotide repeats (18,116 number of SSRs) and deca-nucleotide repeats (50 number of SSRs), a total 76.99%) with a maximum of three tandem reiterations (254,481). In fact, three tandem repeats were the of 382,164 SSRs were obtained. The di-nucleotide repeats were predominant (294,244, 76.99%) with a most common among the repeat motifs. The total number of di-nucleotide motifs were followed by maximum of three tandem reiterations (254,481). In fact, three tandem repeats were the most common tri- (19.5%),among tetra-the repeat (2.68%), motifs. penta- The (0.59%),total number hexa- of (0.17%),di-nucleotide hepta- motifs (0.052%), were andfollowed octa-nucleotide by tri- (19.5%), repeats (0.018%)tetra- (Table (2.68%),3). AT/ATpenta- (0.59%), (138,978, hexa- 36.36%) (0.17%), was hepta- the most (0.052%), abundant and octa-nucleotide motif in the SSR repeats profile, (0.018%) followed by AC/GT(Table 3). (72,855, AT/AT 19.06%), (138,978, and36.36%) AG/CT was the (53,967, most abundant 14.12%) (Figuremotif in 14theA). SSR There profile, was followed a consistency by observedAC/GT with (72,855, the 19.06%), tri-nucleotide and AG/CT repeat (53,967, motifs 14.12%) with (Figure minor 14A). variations There was in a the consistency numbers. observed From the S. takanosiswith theunigene tri-nucleotide information, repeat motifs we derived with minor a total variations of 402,685 in SSRsthe numbers. with 67,023 From (55.33%)the S. takanosis sequences containingunigene more information, than one we SSR. derived A total a total of 141,700 of 402,685 SSRs SSRs were with present 67,023 in(55.33%) compound sequences form. containing We obtained a totalmore of 390,516than one SSRsSSR. A after total elimination of 141,700 SSRs of mono-nucleotidewere present in compound repeats form. (12,142 We numberobtained ofa total SSRs) of and 390,516 SSRs after elimination of mono-nucleotide repeats (12,142 number of SSRs) and deca- deca-nucleotide repeats (27 number of SSRs). The di-nucleotide repeats formed the largest group with nucleotide repeats (27 number of SSRs). The di-nucleotide repeats formed the largest group with 310,901 SSRs (79.6%) followed by the tri- (67,604), tetra- (9098), penta- (1960), hexa- (738), hepta- (148), 310,901 SSRs (79.6%) followed by the tri- (67,604), tetra- (9098), penta- (1960), hexa- (738), hepta- (148), and octa-nucleotideand octa-nucleotide motifs motifs (93) (93) (Table (Table4). 4). The The most most abundantabundant repeat repeat motifs motifs identified identified were were AT/AT AT/AT (135,984,(135,984, 34.83%), 34.83%), AC/GT AC/GT (83,729, (83,729, 21.45%), 21.45%), andand AG/CT (65,303, (65,303, 16.73%) 16.73%) (Figure (Figure 14B). 14 B).

Figure 14. Frequency distribution of simple sequence repeats based on motif sequence types. (A) P. Figure 14. Frequency distribution of simple sequence repeats based on motif sequence types. superans; and (B) S. takanosis. (A) P. superans; and (B) S. takanosis.

2996316 Int. J. Mol. Sci. 2015, 16, 29948–29970

Table 3. Summary of simple sequence repeat (SSR) types in the Protantigius superans transcriptome.

Repeat Numbers Motif Length 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ě21 Number # % $ Di 254,481 30,793 5254 1495 642 425 300 190 155 142 33 70 43 33 39 29 18 20 82 294,244 76.99 Tri 61,763 8858 2514 765 332 182 29 15 14 10 7 1 9 6 3 2 2 3 10 74,525 19.5 Tetra 8967 1050 196 32 1 1 4 3 3 0 0 2 0 0 0 2 1 0 3 10,265 2.68 Penta 2024 175 32 4 5 0 0 0 0 0 0 1 0 0 0 0 0 1 0 2242 0.59 Hexa 538 82 6 9 0 2 1 0 3 0 2 0 0 0 0 0 0 0 0 643 0.17 Hepta 156 14 3 3 1 3 1 2 1 0 0 0 1 1 0 0 0 0 0 186 0.052 Octa 55 1 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 59 0.018 Total 327,984 40,973 8007 2308 981 613 336 210 176 152 42 74 53 40 42 33 21 24 95 382,164 100.00 # Number of SSRs detected in the unigenes; $ Relative percent of SSRs with different motif lengths among the total SSRs.

Table 4. Summary of simple sequence repeat (SSR) types in the Spindasis takanosis transcriptome.

Repeat Numbers Motif Length 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ě21 Number # % $ Di 272,076 31,299 4624 1078 476 291 211 149 142 140 33 46 62 57 38 30 31 17 101 310,901 79.61 Tri 59,652 5834 1249 416 215 111 16 21 16 11 4 5 4 4 8 4 3 5 26 67,604 17.31 Tetra 7734 859 295 134 10 9 6 11 8 3 8 6 7 2 2 2 1 0 1 9098 2.33 Penta 1687 185 49 6 11 6 5 3 1 4 1 0 0 0 1 0 0 1 0 1960 0.50 Hexa 613 77 5 2 3 2 1 1 3 0 1 2 1 0 1 0 0 0 0 738 0.019 Hepta 113 17 2 5 3 1 3 1 2 0 0 0 0 1 0 0 0 0 0 148 0.003 Octa 73 4 10 2 0 0 2 2 0 0 0 0 0 0 0 0 0 0 0 93 0.002 Total 341,948 38,275 6234 1643 718 420 244 188 172 158 47 59 74 64 50 36 35 23 128 390,516 100.00 # Number of SSRs detected in the unigenes; $ Relative percent of SSRs with different motif lengths among the total SSRs.

29964 Int. J. Mol. Sci. 2015, 16, 29948–29970

SSRs derived from the Tenebrio molitor transcriptome database also show AT as the most abundant motif and SSRs with five repeat units as the most common genetic marker [47]. Another study reported a SSR profile of 92 in the butterfly Euphydryas editha [48] which accumulates the rarest SSRs of Lepidopteran genomes. A thorough understanding of SSRs in the Lycaenid butterflies will be useful for the development of markers for genetic diversity assessment, gene flow characterization and conservation genomics. Additionally, SSRs from transcriptome datasets are critical for the identification of associations with functional genes and cataloguing the phenotypes [49]. With the exception of Mono-nucleotide repeat motifs that may be a result of sequencing, other repeat motifs will be suitable for polymorphic microsatellite loci identification [50,51]. A list of informative PCR primers targeting the most relevant repeat types (di-, tri-, tetra-nucleotide with a minimum of seven repeats) has been shown for P. superans (Table S4) and S. takanosis (Table S5). These sequences will be significant in further studies of genetic variation, population and conservation genomics of the species.

3. Experimental Section

3.1. Ethics Statement For the collection of endangered Lycaenid butterflies, S. takanosis and P. superans, necessary permission was accorded from Hangang River Basin Environmental Office (Ref. No. 2014-26; 17 July 2014) and Wongju Regional Environmental Office (Ref. No. 2014-22; 30 June 2014), Korea.

3.2. Sample Preparation and Illumina Sequencing The geographic origins of P. superans and S. takanosis for sequencing were Gangwon-do and Gyeonggi-do regions of Korea, respectively. A total of two individuals from each species were used for experimental purposes as per the notification of the Ministry of Environment, South Korea. Total RNA was extracted from the adults (pooled whole body samples) of S. takanosis and P. superans using Trizol Reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s instructions. The processed RNA were checked for purity and integrity using Nanodrop-2000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA) and the Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). The mRNA-seq library was constructed using the mRNA-seq sample preparation kit (Illumina, San Diego, CA, USA). In the process, the total RNA was treated with DNase I, and magnetic beads with Oligo(dT) to purify poly(A+) mRNA from it. The purified mRNA was fragmented using the DNA fragmentation kit (Ambion, Austin, TX, USA) prior to cDNA synthesis. The short fragments of mRNA were used to transcribe first-strand cDNA using reverse-transcriptase (Invitrogen) and random hexamer-primers. The synthesis of second-strand cDNA was accomplished using DNA polymerase I (New England BioLabs, Ipswich, MA, USA) and RNase H (Invitrogen). Subsequently, the double-stranded cDNA was end-repaired using T4 DNA polymerase, the Klenow fragment, and the T4 polynucleotide kinase (New England BioLabs). The end-repaired cDNA fragments were connected with PE (Paired-end) Adapter Oligo Mix using T4 DNA ligase (New England BioLabs) at room temperature for 15 min. The suitable fragments (200 ˘ 25 bp) separated on a 2% agarose gel electrophoresis matrix were paired-end sequenced on an ultra-high-throughput Illumina HiSeq 2500 sequencer. Illumina short-reads is an appropriate NGS platform for the sequencing of transcriptomes in non-model species due to its affordability and output efficiency [31,52].

3.3. De Novo Assembly and Annotation The raw paired-end reads of the two Lycaenid transcriptomes (S. takanosis and P. superans) were cleaned by filtering out adapter (nucleotide length of recognized adapter ď13 and the remaining adapter-excluded nucleotide length ď35), repeated, and low-quality reads (phred quality score of less than 20) that may affect optimum assembly analysis and annotation. We command-line tool Cutadapt

29965 Int. J. Mol. Sci. 2015, 16, 29948–29970 with default parameters (for paired-end reads: -a ADAPT1 -A ADAPT2; -o out1. fastq -p out2. fastq in1. fastq in2. fastq) [53] for pre-processing of raw reads. Cutadapt was selected over other popular adaptor trimmer programs [54] as it has one of the highest Mathew’s correlation coefficients (mCC)—a quality indicator for pattern recognition. The clean reads from the samples were assembled with the short reads assembling program, called Trinity (v2.0.6) [24] with 200 GB of memory and a path reinforcement distance of 50. The Trinity assembler with the default options (fastq type reads; paired read: RF; number of CPUs: eight; minimum assembled contig length of 200 bp) first assembled the reads to form longer fragments without gaps called contigs. These contigs were further assembled to unigenes (having 94% identity, 30 bp overlap) using sequence clustering software TIGR gene indices clustering tool (TGICL) [55]. After the elimination of redundant sequences, the longest transcripts were recognized as unigenes and were used for functional annotation analysis. All the unigenes were searched against the PANM reference database (PANM-DB) [56] using the BLASTx program with an E-value threshold of 1.0 ˆ 10´5 for the identification of functional transcripts. PANM-DB combines protein sequence data of Arthropoda, Nematoda, and Mollusks (in multi-FASTA format) downloaded from the browser of NCBI nr database. PANM-DB is freely downloadable from amino acid database BLAST web-interface of Malacological Society of Korea. Subsequently, the unigenes were blasted against Unigene DB [57], Eukaryotic clusters of orthologous groups (KOG) DB [58], and Kyoto Encyclopedia of Genes and Genomes (KEGG) DB [59] using BLASTX at a typical cut-off E-value of less than 1.0 ˆ 10´5. Number of unigenes that were either unique or shared among PANM-DB, Unigene DB and KOG DB were visualized using a three-way Venn diagram plot constructed using Venny [60]. The gene ontology (GO) annotations presented represent the level 2 analysis, illustrating the predicted function of the assembled unigenes under biological process, molecular function, and cellular component category. The GO analysis was conducted using the professional BLAST2GO suite [61]. InterProScan at BLAST2GO was used to annotate the assembled unigenes with characteristic protein domains [62].

3.4. Identification of cSSR Markers MicroSAtellite (MISA) [63] was used to decipher microsatellites in the unigene sequences of the Lycaenid butterflies S. takanosis and P. superans. The Simple Sequence Repeats (SSR) searches were run on default mode with detection of mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide motifs, including the compound SSR (with more than one type of repeat unit). Primer pairs flanking the SSR motifs were designed using BatchPrimer3 [64] with the following criteria: primer lengths of 18–23 bases (optimum size of 21 bases), product size of 100–300 bases, Tm-50–470 ˝C (optimum 55 ˝C), and primer GC content of 30%–70%.

4. Conclusions In this study, the transcriptomes of endangered Lycaenid butterflies P. superans and S. takanosis were sequenced using Illumina HiSeq 2500. A de novo assembly and transcript annotation approach resulted in the identification of unigenes related to functional GO categories and KEGG pathways. Furthermore, the identification of SSRs with repeat types will assist in the development of large-scale molecular markers for the species. The valuable transcriptome sequence and functional information will provide necessary cues towards the successful implementation of sustainable conservation plans for the butterfly species in their preferred habitat. The sequence information to be indexed in the databases will be the basis for an evolutionary developmental study of Lycaenidae with symbiotic ants.

Supplementary Materials: Supplementary materials can be found at http://www.mdpi.com/1422-0067/ 16/12/26213/s1. Acknowledgments: This work was supported by the grant entitled “The Genetic and Genomic Evaluation of Indigenous Biological Resources” funded by the National Institute of Biological Resources (NIBR201503202).

29966 Int. J. Mol. Sci. 2015, 16, 29948–29970

Author Contributions: Bharat Bhusan Patnaik, Hee-Ju Hwang, Soonok Kim and Yong Seok Lee designed the experiments. So Young Park, Tae Hun Wang, Eun Bi Park, Jong Min Chung, Dae Kwon Song and Jae Bong Lee performed the experiments. Bharat Bhusan Patnaik, Hee-Ju Hwang and Se Won Kang analyzed the data. Bharat Bhusan Patnaik, Hee-Ju Hwang and Se Won Kang wrote the paper. Heon Cheon Jeong, Changmu Kim, Soonok Kim, Hong Seog Park and Yeon Soo Han contributed reagents/materials/analysis tools. Yong Seok Lee supervised the entire study. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Fox, R.; Warren, M.S.; Brereton, T.M.; Roy, D.B.; Robinson, A. A new Red List of British butterflies. Insect Conserv. Divers. 2011, 4, 159–172. [CrossRef] 2. Nakamura, Y. Conservation of butterflies in Japan: Status, actions and strategy. J. Insect Conserv. 2011, 15, 5–22. [CrossRef] 3. IUCN. IUCB Red List of Threatened Species. Version 2010.4. Available online: http://www.iucnredlist.org (accessed on 14 July 2015). 4. Van Swaay, C.; Cuttelod, A.; Collins, S.; Maes, D.; Lopez Munguira, M.; Sasic, M.; Settele, J.; Verovnik, R.; Verstrael, T.; Warren, M.; et al. European Red List of Butterflies; Publications Office of the European Union: Luxembourg, 2010. 5. Choi, S.W.; Kim, S.S. The past and current status of endangered butterflies in Korea. Entomol. Sci. 2012, 15, 1–12. [CrossRef] 6. National Institute of Biological Resources. Korean Red List of Threatened Species, 2nd ed.; National Institute of Biological Resources: Incheon, Korea, 2014. 7. Pierce, N.E.; Braby, M.F.; Heath, A.; Lohman, D.J.; Mathew, J.; Rand, D.B.; Travassos, M.A. The ecology and evolution of ant association in the Lycaenidae (Lepidoptera). Annu. Rev. Entomol. 2002, 47, 733–771. [CrossRef][PubMed] 8. Fiedler, K. The host genera of Ant-Parasitic Lycaenidae Butterflies: A Review. 2012, 10, 153975. [CrossRef] 9. Thomas, J.A.; Simcox, D.J.; Clarke, R.T. Successful conservation of a threatened Maculinea butterfly. Science 2009, 325, 80–83. [CrossRef][PubMed] 10. Bonebrake, T.C.; Ponisio, L.C.; Boggs, C.L.; Ehrlich, P.R. More than just indicators: A review of tropical butterfly ecology and conservation. Biol. Conserv. 2010, 143, 1831–1841. [CrossRef] 11. Ministry of Environment. Endangered Plants and Animals in Korea; Ministry of Environment: Seoul, Korean, 2005. 12. Jang, Y.J. Review on host ant of social parasitic Myrmecophiles in Korean Lycaenidae (Lepidoptera). J. Lepd. Soc. Korea 2007, 17, 29–38. 13. Kim, I.; Lee, E.M.; Seol, K.Y.; Yun, E.Y.; Lee, Y.B.; Hwang, J.S.; Jin, B.R. The mitochondrial genome of the Korean hairstreak, Coreana raphaelis (Lepidoptera: Lycaenidae). Insect Mol. Biol. 2006, 15, 217–225. [CrossRef][PubMed] 14. Kim, M.J.; Kang, A.R.; Jeong, H.C.; Kim, K.G.; Kim, I. Reconstructing intraordinal relationships in Lepidoptera using mitochondrial genome data with the description of two newly sequenced lycaenids, Spindasis takanosis and Protantigius superans (Lepidoptera: Lycaenidae). Mol. Phylogenet. Evol. 2011, 61, 436–445. [CrossRef][PubMed] 15. Allendorf, F.W.; Hohenlohe, P.A.; Luikart, G. Genomics and the future of conservation genetics. Nat. Rev. Genet. 2010, 11, 697–709. [CrossRef][PubMed] 16. Hoffman, J.I.; Simpson, F.; David, P.; Rijks, J.M.; Kuiken, T.; Thorne, M.A.; Lacy, R.C.; Dasmahapatra, K.K. High throughput sequencing reveals inbreeding depression in a natural population. Proc. Natl. Acad. Sci. USA 2014, 111, 3775–3780. [CrossRef][PubMed] 17. Nagaraj, S.H.; Gasser, R.B.; Ranganathan, S. A hitchhiker’s guide to expressed sequence tag (EST) analysis. Brief. Bioinform. 2007, 8, 6–21. [CrossRef][PubMed] 18. Vera, J.C.; Wheat, C.W.; Fescemeyer, H.W.; Frilander, M.J.; Crawford, D.L.; Hanski, I.; Marden, J.H. Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Mol. Ecol. 2008, 17, 1636–1647. [CrossRef][PubMed]

29967 Int. J. Mol. Sci. 2015, 16, 29948–29970

19. Smee, M.R.; Pauchet, Y.; Wilkinson, P.; Wee, B.; Singer, M.C.; French-Constant, R.H.; Hodgson, D.J.; Mikheyev, A.S. Microsatellites for the Marsh Fritillary Butterfly: De Novo transcriptome sequencing, and a comparison with amplified length polymorphism (AFLP) markers. PLoS ONE 2013, 8, e54721. [CrossRef] [PubMed] 20. Gompert, Z.; Lucas, L.K.; Fordyce, J.A.; Forister, M.L.; Nice, C.C. Secondary contact between Lycaeides idas and L. Melissa in the Rocky Mountains: Extensive admixture and a patchy hybrid zone. Mol. Ecol. 2010, 19, 3171–3192. [CrossRef][PubMed] 21. O’Bryhim, J.; Chong, J.P.; Lance, S.L.; Jones, K.L.; Roe, K.J. Development and characterization of sixteen microsatellite markers for the federally endangered species: Leptodea leptodon (Bivalvia: Unionidae) using paired-end Illumina shotgun sequencing. Conserv. Genet. Res. 2012, 4, 787–789. [CrossRef] 22. Lance, S.L.; Love, C.N.; Nunziata, S.O.; O’Bryhim, J.R.; Scott, D.E.; Flynn, R.W.; Jones, K.L. 32 species validation of a new Illumina paired-end approach for the development of microsatellites. PLoS ONE 2013, 8, e81853. [CrossRef][PubMed] 23. Zhan, S.; Merlin, C.; Boore, J.L.; Reppert, S.M. The monarch butterfly genome yields insights into long-distance migration. Cell 2011, 147, 1171–1185. [CrossRef][PubMed] 24. Grabherr, M.G.; Haas, B.J.; Yassour, M.; Levin, J.Z.; Thompson, D.A.; Amit, I.; Adiconis, X.; Fan, L.; Raychowdhury, R.; Zeng, Q.; et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 2011, 29, 644–652. [CrossRef][PubMed] 25. Xie, Y.; Wu, G.; Tang, J.; Luo, R.; Patterson, J.; Liu, S.; Huang, W.; He, G.; Gu, S.; Li, S.; et al. SOAPdenovo-Trans: De novo transcriptome assembly with short RNA-Seq reads. Bioinformatics 2014, 30, 1660–1666. [CrossRef][PubMed] 26. Birol, I.; Jackman, S.D.; Nielsen, C.B.; Qian, J.Q.; Varhol, R.; Stazyk, G.; Morin, R.D.; Zhao, Y.; Hirst, M.; Schein, J.E.; et al. De novo transcriptome assembly with ABySS. Bioinformatics 2009, 25, 2872–2877. [CrossRef] [PubMed] 27. Schultz, M.H.; Zerbino, D.R.; Vinqron, M.; Birney, E. Oases: Robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 2012, 28, 1086–1092. [CrossRef][PubMed] 28. Jimenez-Guri, E.; Huerta-Cepas, J.; Cozzuto, L.; Wotton, K.R.; Kang, H.; Himmelbauer, H.; Roma, G.; Gabaldon, T.; Jaeger, J. Comparative transcriptomics of early dipteran development. BMC Genom. 2013, 14.[CrossRef][PubMed] 29. Zhou, X.; Qian, K.; Tong, Y.; Zhu, J.J.; Qiu, X.; Zeng, X. De novo transcriptome of the hemimetabolous German cockroach (Blattella germanica). PLoS ONE 2014, 9, e106932. [CrossRef][PubMed] 30. Chen, H.; Lin, L.; Xie, M.; Zhang, G.; Su, W. De novo sequencing, assembly and characterization of antennal transcriptome of Anomala corpulenta Motschulsky (Coleoptera: Rutelidae). PLoS ONE 2014, 9, e114238. [CrossRef][PubMed] 31. Riesgo, A.; Andrade, S.C.S.; Sharma, P.P.; Novo, M.; Perez-Porro, A.R.; Vahtera, V.; Gonzalez, V.L.; Kawauchi, G.Y.; Giribet, G. Comparative description of ten transcriptomes of newly sequenced invertebrates and efficiency estimation of genomic sampling in non-model taxa. Front. Zool. 2012, 9. [CrossRef][PubMed] 32. Wang, X.J.; Xu, R.H.; Wang, R.I.; Liu, A.Z. Transcriptome analysis of Sacha Inchi (Plukenetia volubilis L.) seeds at two developmental stages. BMC Genom. 2012, 13.[CrossRef][PubMed] 33. Vogel, H.; Altincicek, B.; Glockner, G.; Vilcinskas, A. A comprehensive transcriptome and immune-gene repertoire of the lepidopteran model host Galleria mellonella. BMC Genom. 2011, 12.[CrossRef][PubMed] 34. De Assis Fonseca, F.C.; Firmino, A.A.P.; de Macedo, L.L.P.; Coelho, R.R.; de Sousa Junior, J.D.A.; Silva-Junior, O.B.; Togawa, R.C.; Pappas Junior, G.J.; Brandao de Gois, L.A.; Mattar da Silva, M.C.; et al. Sugarcane giant borer transcriptome analysis and identification of genes related to digestion. PLoS ONE 2015, 10, e0118231. [CrossRef][PubMed] 35. Nirmala, X.; Schetelig, M.F.; Yu, F.; Handler, A.M. An EST database of the Caribbean fruit fly, Anastrepha suspense (Diptera: Tephritidae). Gene 2013, 517, 212–217. [CrossRef][PubMed] 36. Rhee, S.Y.; Wood, V.; Dolinski, K.; Draghici, S. Use and misuse of the gene ontology annotations. Nat. Rev. Genet. 2008, 9, 509–515. [CrossRef][PubMed] 37. Tanaka, H.; Ishibashi, J.; Fujita, K.; Nakajima, Y.; Sagisaka, A.; Tomimoto, K.; Suzuki, N.; Yoshiyama, M.; Kaneko, Y.; Iwasaki, T.; et al. A genome-wide analysis of genes and gene families involved in innate immunity of Bombyx mori. Insect Biochem. Mol. Biol. 2008, 38, 1087–1110. [CrossRef][PubMed]

29968 Int. J. Mol. Sci. 2015, 16, 29948–29970

38. Rao, X.J.; Cao, X.; He, Y.; Hu, Y.; Zhang, X.; Chen, Y.R.; Blissard, G.; Kanost, M.R.; Yu, X.Q.; Jiang, H. Structural features, evolutionary relationships, and transcriptional regulation of C-type lectin-domain proteins in Manduca sexta. Insect Biochem. Mol. Biol. 2015, 62, 75–85. [CrossRef][PubMed] 39. Zagrobelny, M.; Scheibye-Alsing, K.; Jensen, N.B.; Moller, B.L.; Gorodkin, J.; Bak, S. 454 pyrosequencing based transcriptome analysis of Zygaena filipendulae with focus on genes involved in biosynthesis of cyanogenic glucosides. BMC Genom. 2009, 10.[CrossRef][PubMed] 40. Bai, X.D.; Mamidala, P.; Rajarapu, S.P.; Jones, S.C.; Mittapalli, O. Transcriptomics of the bed bug (Cimex lectularius). PLoS ONE 2011, 6, e16336. [CrossRef][PubMed]

41. Brayer, K.J.; Segal, D.J. Keep your fingers off my DNA: Protein-protein interactions mediated by C2H2 zinc finger domains. Cell Biochem. Biophys. 2008, 50, 111–131. [CrossRef][PubMed]

42. Seetharam, A.; Bai, Y.; Stuart, G.W. A survey of well conserved families of C2H2 zinc-finger genes in Daphnia. BMC Genom. 2010, 11.[CrossRef][PubMed] 43. Altincicek, B.; Vilcinskas, A. Identification of immune-related genes from an apterygote insect, the firebrat Thermobia domestica. Insect Biochem. Mol. Biol. 2007, 37, 726–731. [CrossRef][PubMed] 44. Jung, H.; Lyons, R.E.; Dinh, H.; Hurwood, D.A.; McWilliam, S.; Mather, P.B. Transcriptomics of a giant freshwater prawn (Macrobrachium rosenbergii): De novo assembly, annotation and marker discovery. PLoS ONE 2011, 6, e27938. [CrossRef][PubMed] 45. Teichmann, S.A.; Chothia, C. Immunoglobulin superfamily proteins in Caenorhabditis elegans. J. Mol. Biol. 2000, 296, 1367–1383. [CrossRef][PubMed] 46. Meng, X.; Zhang, Y.; Bao, H.; Liu, Z. Sequence analysis of insecticide action and detoxification-related genes in the insect pest natural enemy Pardosa pseudoannulata. PLoS ONE 2015, 10, e0125242. [CrossRef][PubMed] 47. Zhu, J.Y.; Wu, G.X.; Yang, B. High-throughput discovery of SSR genetic markers in the yellow mealworm beetle, Tenebrio molitor (Coleoptera: Tenebrionidae), from its transcriptome database. Acta Entomol. Sin. 2013, 56, 724–728. 48. Mikheyev, A.S.; Vo, T.; Wee, B.; Singer, M.C.; Parmesan, C. Rapid microsatellite isolation from a butterfly by de novo transcriptome sequencing: Performance and a comparison with AFLP-derived distances. PLoS ONE 2010, 5, e11212. [CrossRef][PubMed] 49. Zalapa, J.E.; Cuevas, H.; Zhu, H.; Steffan, S.; Senalik, D.; Zeldin, E.; McCown, B.; Harbut, R.; Simon, P. Using next-generation sequencing approaches to isolate simple sequence repeat (SSR) loci in the plant sciences. Am. J. Bot. 2012, 99, 193–208. [CrossRef][PubMed] 50. Miller, A.D.; Good, R.T.; Coleman, R.A.; Lancaster, M.L.; Weeks, A.R. Microsatellite loci and the complete mitochondrial DNA sequence characterized through next generation sequencing and de novo assembly for the critically endangered orange-bellied parrot, Neophema chrysogaster. Mol. Biol. Rep. 2013, 40, 35–42. [CrossRef][PubMed] 51. Zhang, S.H.; Luo, H.; Du, H.; Wang, D.Q.; Wei, Q.W. Isolation and characterization of twenty-six microsatellite loci for the tetraploid fish Dabry’s sturgeon (Acipenser dabryanus). Conserv. Genet. Res. 2013, 5, 409–412. [CrossRef] 52. Feldmeyer, B.; Wheat, C.W.; Krezdorn, N.; Rotter, B.; Pfenninger, M. Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genom. 2011, 12.[CrossRef][PubMed] 53. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 2011, 17, 10–12. [CrossRef] 54. Jiang, H.; Lei, R.; Ding, S.-W.; Zhu, S. Skewer: A fast and accurate adapter trimmer for next-generation sequencing paired-end reads. BMC Bioinform. 2014, 15, 182. [CrossRef][PubMed] 55. Pertea, G.; Huang, X.; Liang, F.; Antonescu, V.; Sultana, R.; Karamycheva, S.; Lee, Y.; White, J.; Cheung, F.; Parvizi, B.; et al. TIGR Gene Indices clustering tools (TGICL): A software system for fast clustering of large EST datasets. Bioinformatics 2003, 19, 651–652. [CrossRef][PubMed] 56. Kang, S.W.; Patnaik, B.B.; Hwang, H.J.; Park, S.Y.; Lee, J.S.; Han, Y.S.; Lee, Y.S. PANM DB (Protostome DB) for the annotation of NGS data of mollusks. Korean J. Malacol. 2015, 31, 243–247. [CrossRef] 57. UniGene. Available online: ftp://ftp.ncbi.nih.gov/repository/UniGene/ (accessed on 17 July 2015). 58. Tatusov, R.L.; Fedorova, N.D.; Jackson, J.D.; Jacobs, A.R.; Kiryutin, B.; Koonin, E.V.; Krylov, D.M.; Mazumder, R.; Mekhedov, S.L.; Nikolskaya, A.N.; et al. The COG database: An updated version includes eukaryotes. BMC Bioinform. 2003, 4.[CrossRef]

29969 Int. J. Mol. Sci. 2015, 16, 29948–29970

59. Kanehisa, M.; Goto, S.; Kawashima, S.; Okuno, Y.; Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Res. 2004, 32, 277–280. [CrossRef][PubMed] 60. Oliveros, J.C. VENNY: An Interactive Tool for Comparing List with Venn Diagram. VENNY Website. Available online: http://bioinfogp.cnb.csic.es/tools/venny/index.html (accessed on 25 August 2015). 61. Consea, A.; Gotz, S.; Garcia-Gomez, J.; Terol, J.; Talon, M.; Robles, M. BLAST2GO: A universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 2005, 21, 3674–3676. [CrossRef][PubMed] 62. Quevillon, E.; Silventoinen, V.; Pillai, S.; Harte, N.; Mulder, N.; Apweiler, R.; Lopez, R. InterProScan: Protein domains identifier. Nucleic Acids Res. 2005, 33, 116–120. [CrossRef][PubMed] 63. MISA-MicroSAtellite Identification Tool. Available online: http://pgrc.ipk-gatersleben.de/misa/ (accessed on 21 August 2015). 64. You, F.M.; Huo, N.; Gu, Y.Q.; Luo, M.C.; Ma, Y.; Hane, D.; Lazo, G.R.; Dvorak, J.; Anderson, O.D. BatchPrimer3: A high throughput web application for PCR and sequencing primer design. BMC Bioinform. 2008, 9.[CrossRef][PubMed]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

29970