Entomological Research 46 (2016) 197–205

RESEARCH ARTICLE Transcriptome profile of Chinese bush cricket, Gampsocleis gratiosa: A resource for microsatellite marker development Zhi-Jun ZHOU1,Xiao-YanKOU1,Lei-YangQIAN1, AND Jing LIU1,2

1 College of Life Sciences, Hebei University, Baoding, China 2 Chengdu Institute of Biology, Chinese Academy of Sciences, Chengdu, China

Correspondence Abstract Zhi-Jun Zhou, College of Life Sciences, Hebei University, Baoding 071002, China. The Chinese bush cricket, Gampsocleis gratiosa, has a long history as a pet in China. To Email: [email protected] date, the sequencing of its whole genome is unavailable as a non-model organism. Transcriptomic information is also scarce for this species. The G. gratiosa transcriptome Received 28 October 2015; was sequenced using Illumina HiSeq 2000 paired-end sequencing technology. In total accepted 18 January 2016. 52 million clean reads with an average length of 90 bp were generated, which produced 74,821 unigenes with a mean length of 580 bp and an N50 length of 759 bp. In total doi: 10.1111/1748-5967.12165 29,674 (39.66%) unigenes were successfully annotated against the NR, NT, Swiss- Prot, Gene Ontology (GO), Clusters of Orthologous Groups of proteins (COGs) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases. Further functionally classified of unigenes against GO, COGs and KEGG found that a total of 11,935 (19.95%) unigenes were categorized into 61 GO terms, 19,576 unigenes were clustered into 25 COG functional categories and 17,971 unigenes were assigned to 258 KEGG pathways. In addition, 2093 microsatellite loci were identified, of which 591 loci had flanking sequences suitable for polymerase chain reaction (PCR) primer design. The transcriptome profile of G. gratiosa contributed to the accumulation of orthopteran genomic data, and the microsatellite loci provided useful tools for future studies of this and other closely related Gampsocleis species.

Key words: Gampsocleis gratiosa, transcriptome, next-generation sequencing, microsatellite DNA markers.

Introduction Microsatellite DNA markers are excellent genetic markers that are commonly used in genetic diversity, population The introduction of next-generation sequencing (NGS) structure and molecular ecological studies (Barker 2002; technologies has led to significant declines in the time and cost Gauthier et al. 2007; Chapuis et al.2008;Parket al. 2012; to generate genomic tools for functional studies. Deep RNA Scholl et al. 2012). Traditional microsatellite DNA sequencing (RNA-seq) produces at least 100 to 1000 times markers development needs partial genomic DNA library higher throughput than classical Sanger sequencing. These construction, cloning and labor-intensive Sanger sequencing. NGS platforms, such as Solexa/Illumina (Illumina), 454 With the advent of NGS technology, it has become possible (Roche) and SOLiD (ABI), provided fascinating opportunities to develop large numbers of microsatellite markers for non- in the life sciences and have dramatically improved the model organisms quickly and cost-efficiently (Peng et al. efficiency and speed of gene discovery (Schuster 2008). More 2014; Wei et al. 2014; Yue et al. 2014; Huang et al.2015). recent studies have reported that the Illumina sequencer can With the current technological limitations, available provide large amounts of longer reads (>75 bp), and the genomic data for orthopteran are very limited. developed assembly programs enable researchers to perform Although the transcriptome only represents a small portion de novo transcriptome sequencing at a lower cost (Crawford of the genome, it includes most protein coding genes and et al. 2010; Yao et al. 2012; Li et al. 2013). arguably represents the most functional part of the genome.

© 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Z-J. Zhou et al

Currently, transcriptomes based on NGS mainly focus adult individuals (one male and one female) after removing on new gene discovery, development of molecular markers the guts using the RNAiso Plus (TaKaRa) following and gene expression profiling, providing the opportunity to manufacturer’s protocol. Samples were collected from reveal gene functions related to insect life activities, Shunping, Hebei, China, in 2012. The quality of total RNA phylogeny and evolution, and the interaction between insects was determined using an Agilent 2100 Bioanalyzer (Agilent and other organisms (Zhang & Yuan 2013). Until now, five Technologies Inc.) with RNA6000 kit. orthopteran transcriptomes have been reported, including the desert locust Schistocerca gregaria central nervous system (Badisco et al. 2011), gregarious and solitary Locusta RNA-seq library construction and Illumina migratoria at various developmental stages (Chen et al. sequencing 2010), Gryllus firmus fat body and flight muscles (Nanoth Selection of mRNA, library preparation and sequencing was Vellichirammal et al. 2014), Gryllus bimaculatus embryonic performed by the BGI-Shenzhen, China on an Illumina and ovarian tissues (Zeng et al. 2013) and Epacromius HiSeq 2000 sequencer according to manufacturer’s coerulipes (Jin et al. 2015). However, more transcriptome data specifications. Briefly, poly (A) mRNA was selected using are needed to establish patterns and formulate meaningful oligo (dT) probes and fragmented into 200–700 bp pieces using hypotheses about orthopteran evolution. divalent cations. Taking these short fragments as templates, The Chinese bush cricket, Gampsocleis gratiosa, is widely random hexamer-primer was used to synthesize the first-strand distributed in most parts of China and in Mongolia, Korea and cDNA. The second-strand cDNA was synthesized using buffer, Russia (East Siberia). It is one of the most famous singing dNTPs, RNaseH and DNA polymerase I. Short fragments were pets, and has been bred in China for over 2000 years. purified with a QiaQuick PCR extraction kit and resolved with G. gratiosa was a tractable orthopteran model for functional EB buffer for end reparation and tailing A. After that, the short genetic studies in the laboratory. So far, studies on G. gratiosa fragments were connected with sequencing adapters. have included the mitochondrial genome (Zhou et al.2008), Following agarose gel electrophoresis, suitable fragments cDNA full-length cloning and bioinformatic analysis of piwi were selected for PCR amplification as templates. Finally, homolog giwi (Liu et al. 2013) and vasa (Kou et al. 2015), the library was sequenced using an Illumina HiSeq 2000. embryonic cell line (Zhang et al. 2011), morphological and structural observation of the nuclei during spermiogenesis (Wang et al. 2014) and effects of mating status of both males Data filtration, de novo assembly, gene functional and females on male copulation investment (Gao & Kang annotation and GO/KEGG classification 2006). However, we still know relatively little about the molecular genetics of G. gratiosa. To date, sequencing of the Raw image data from the sequencing were transformed by base G. gratiosa whole genome and transcriptome are unavailable calling into sequence data, which were called raw reads. The as a non-model organism. Transcriptome sequencing is an clean reads, obtained after filtering dirty raw reads, were used efficient way to generate functional genomic-level data for for bioinformatic analysis. Transcriptome de novo assembly non-model organisms. No microsatellite DNA markers are was carried out with the short reads assembling program Trinity available in the public domain for G. gratiosa and closely (Grabherr et al. 2011). As the Trinity assembler discards low- related species, creating an incentive for further discovery coverage K-mers, no quality trimming of the reads was and validation. performed prior to the assembly. First, Trinity combined the In this study, we performed the first de novo transcriptome reads with a certain overlap length to form longer fragments, data of G. gratiosa using the high-throughput Illumina which were called contigs. Second, these reads were mapped sequencer. After de novo assembly, we implemented a back to contigs; with paired-end reads, Trinity was able to detect functional annotation using bioinformatic analysis. In contigs from the same transcript and determine the distances addition, we made a genome-wide search for microsatellite between these contigs. Finally, Trinity connected these contigs DNA loci from the transcriptome. into sequences that could not be extended on either end. In order to exclude the interference from alternative splicing of transcripts, we first clustered all transcripts that matched the same reference gene. We then removed redundant transcripts Materials and methods and only preserved the longest transcript from each cluster to represent a unique gene. Such sequences were defined as Insects and RNA extraction unigenes. After clustering, the unigenes could be divided into In order to reveal as many genes as possible, while eliminating two classes: clusters and singletons. Assembled sequences were possible food residue and intestinal symbiotic organisms, total annotated to the protein databases based on BLAST similarity RNA was extracted from whole bodies of two G. gratiosa using BLASTX (Altschul et al. 1990) with an e-value cut-off

198 Entomological Research 46 (2016) 197–205 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Transcriptome and microsatellite markers

<1.0e–5. These databases including non-redundant protein HiSeq 2000 sequencing yielded 52,182,518 high-quality (NR), NCBI non-redundant nucleotide sequence (NR), transcriptomic reads with a total size of 4,696,426,620 bp. Swiss-Prot, Gene Ontology (GO) (Harris et al. 2008), Clusters These reads were assembled into 74,821 unigenes and of Orthologous Groups of proteins (COG) (Tatusov et al. 2000) 197,744 contigs. The N50 and mean length of unigenes was and Kyoto Encyclopedia of Genes and Genomes (KEGG) 759 bp and 580 bp, and the contig was 252 bp and 318 bp (Kanehisa & Goto 2000). Protein function information can be (Table 1). predicted from annotation of the most similar protein in those For annotation, we used NR, NT, Swiss-Prot, GO, COG databases. If the alignment results of different databases and KEGG gene function databases to annotate the assembled conflicted with each other, we followed the priority order of unigenes. There were 28,339 (37.88%) unigenes with NR, NT, Swiss-Prot, GO, COG and KEGG when determining significant matches in the NR database, 9,765 matched the the unigene sequence direction. GO categories (Harris et al. NT database, and 20,341 were similar to proteins in the 2008), COG (Tatusov et al. 2000) and KEGG pathways Swiss-Prot database. In total 29,674 (39.66%) unigenes were (Kanehisa & Goto 2000) were used to classify the functions successfully annotated in these databases (Table 2). The and metabolic pathways of the transcripts. GO and KEGG e-value distribution showed that 33.54% of the annotated classification was performed using Blast2GO (Conesa et al. sequences had strong homology (e-value < 1.0e–45), and 2005) pipelines with default parameters. similarity distribution showed that 33.20% of the annotated sequences had a similarity greater than 60% (Fig. 1).

Microsatellite loci identification and primer design GO classification We used the perl script MicroSAtelitte identification tool (MISA, http://pgrc.ipk-gatersleben.de/misa) to identify GO is an international standardized gene functional microsatellite loci in all G. gratiosa transcriptome unigene classification system that offers a dynamically updated sequences. The parameters were designed: ① perfect mono-, controlled vocabulary and a strictly defined concept to di-, tri-, tetra-, penta- and hexa-nucleotide motifs with a comprehensively describe properties of genes and their minimumoften,six,five,five,fiveandfiverepeats, products in any organism (Harris et al.2008).GOhasthree respectively; ② the maximum number of bases interrupting categories: molecular function, cellular component and two microsatellite loci in a compound microsatellite was 100. biological process. The GO term is the basic unit, and every Each microsatellite loci was considered as unique and was category consists of different numbers of GO terms. With nr subsequently classified according to theoretically possible annotation, we used the Blast2GO program (Conesa et al. combinations in each microsatellite DNA; for example, (AC)n 2005) to obtain the GO annotation of the G. gratiosa is equivalent to (CA)n’,(TG)n’ and (GT)n’. The microsatellite numbers, motifs, repeat numbers, length of the repeat, repeat Table 1 Results of de novo assembly and annotation of G. gratiosa type, start and end positions of the repeat, and microsatellite transcriptome sequences were analyzed. Newly identified microsatellite loci are in general useful only Search item Value if it is possible to design primers in the non-repeated flanking Total number of raw reads 58,066,146 regions that can be successfully used for PCR amplification Total number of clean reads 52,182,518 (Angeloni et al. 2011). We therefore designed primers for the Total clean nucleotides (nt) 4,696,426,620 genomic sequence flanking these microsatellite loci using Q20 (%) 96.86 Primer 3_2.2.3 (Untergasser et al. 2012). Primers were Total number of contigs 197, 744 designed to generate amplicons of 100 to 400 bp in length with Total contig length (nt) 49,856,308 the following Tm (°C) values ranging from 57 °C to 63 °C. Mean contig length (nt) 252 Other parameters used the program default values. N50 of contig (nt) 318 Total number of unigenes 74, 821 Total unigene length (nt) 43,373,070 Mean unigene length (nt) 580 Results N50 of unigene (nt) 759 Total consensus sequences of unigenes 74,821 Illumina sequencing, de novo assembly and gene Distinct clusters of unigenes 17,780 annotation Distinct singletons 57,041

The reads produced by the Illumina HiSeq 2000 were used for Q, Percentage of bases whose quality larger than 20 in clean reads; N50, clustering and de novo assembly. After eliminating adapter the maximum length X such that 50% of all nucleotides lie in contig (or sequences and filtering out the low-quality reads, Illumina unigene) of size at least X.

Entomological Research 46 (2016) 197–205 199 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Z-J. Zhou et al

Table 2 Summary of functional annotation of assembled G. gratiosa transcriptome unigenes

Public database NR NT Swiss-Prot GO COG KEGG Total

Number of unigenes 28,339 9765 20,341 11,935 8079 17,971 29,674

Figure 1 Characteristics of gene annotation of assembled G. gratiosa transcripts against the reference dataset. (A) e-value distribution of BLASTX – hits for transcript with a cut-off e-value of 1.0e 5. (B) Similarity distribution of BLASTX hits for transcript. transcriptome unigene. After getting the GO annotation for molecular function) were assigned to 39,153 (49.56%), every G. gratiosa transcriptome unigene, we use WEGO 24,491 (31.00%) and 15,350 (19.43%) GO terms, software (Ye et al. 2006) to complete GO functional respectively. The most common assignments in three classification for all unigenes and to understand the categories were cellular process (7182 unigenes, accounting distribution of gene functions of the species from the macro for 60.18%), cell (5532 unigenes, accounting for 46.35%) level. In total 11,935 (19.95%) unigenes were categorized into and binding (6363 unigenes, accounting for 53.31%). An 61 functional groups in G. gratiosa.Thethreemajor overall view of the distribution of the sequences in the three categories (biological process, cellular component and ontologies was given in Figure 2.

Figure 2 Distribution of Gene Ontology (GO) categories of transcripts for G. gratiosa.

200 Entomological Research 46 (2016) 197–205 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Transcriptome and microsatellite markers

COG classification wall/membrane/envelope biogenesis” (1037, 5.30%). The smallest categories (fewer than 100) were “nuclear structure” COGs were delineated by comparing protein sequences (4 unigenes, accounting for 0.02%), “extracellular structures” encoded in complete genomes, representing major phylogenetic (42 unigenes, accounting for 0.21%) and “RNA processing lineages (Tatusov et al. 2000). Each COG consists of individual and modification” (79 unigenes, accounting for 0.40%). proteins or groups of paralogs from at least three lineages, and thus corresponds to an ancient conserved domain. COG is a database where orthologous gene products were classified. KEGG classification Every protein in the COGs was assumed to be evolved from an ancestor protein, and the whole database was built on coding The KEGG pathway database records networks of molecular proteins with complete genome as well as system evolution interactions in the cells, and variants of them specific to relationships of bacteria, algae and eukaryotes. Gampsocleis particular organisms (Kanehisa & Goto 2000). To gratiosa transcriptome unigenes were aligned to the COG understand the biological pathways involved in G. gratiosa, database to predict and classify possible functions. In total, we mapped unigenes to terms in the KEGG database. In 19,576 unigenes were clustered into 25 COG functional total 17,971 unigenes were assigned to 258 KEGG categories (Fig. 3). These results indicated that “general pathways. The major pathways containing 500 unigenes function prediction only” (3185, 16.27%) was found to be the were metabolic pathways (ko01100) (2823 unigenes, major COG category, followed by eight COG functional accounting for 15.71%), pathways in cancer (ko05200) categories with more than 1000 unigenes: “replication, (728 unigenes, accounting for 4.05%), purine metabolism recombination and repair”(1840 unigenes, accounting for (ko00230) (720 unigenes, accounting for 4.01%), regulation 9.40%), “translation, ribosomal structure and biogenesis”(1533 of actin cytoskeleton (ko04810) (641 unigenes, accounting unigenes, accounting for 7.83%), “transcription” (1339 for 3.57%), focal adhesion (ko04510) (620 unigenes, unigenes, accounting for 6.84%), “cell cycle control, cell accounting for 3.45%), bile secretion (ko04976) (606 division, chromosome partitioning” (1312 unigenes, unigenes, accounting for 3.37%), pyrimidine metabolism accounting for 6.70%), “carbohydrate transport and (ko00240) (594 unigenes, accounting for 3.31%), protein metabolism” (1180 unigenes, accounting for 6.03%), processing in endoplasmic reticulum (ko04141) (556 “posttranslational modification, protein turnover, chaperones” unigenes, accounting for 3.09%), RNA transport (ko03013) (1176 unigenes, accounting for 6.01%), “function unknown” (553 unigenes, accounting for 3.08%), spliceosome (1149 unigenes, accounting for 5.87%) and “cell (ko03040) (542 unigenes, accounting for 3.02%), Fc gamma

Figure 3 Number of G. gratiosa unigenes in the 25 Clusters of Orthologous Groups (COG) functional classes.

Entomological Research 46 (2016) 197–205 201 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Z-J. Zhou et al

R-mediated phagocytosis (ko04666) (526 unigenes, Table 3 Summary of microsatellite DNA loci search from G. gratiosa accounting for 2.93%), Epstein-Barr virus infection transcriptome (ko05169) (526 unigenes, accounting for 2.93%), Huntington’s disease (ko05016) (524 unigenes, accounting Search item Value for 2.92%) and ABC transporters (ko02010) (506 unigenes, Total number of sequences examined 32,629 accounting for 2.82%) (Fig. 4). Total size of examined sequences (nt) 27,278,433 Total number of identified microsatellite DNA loci 2093 Number of microsatellite DNA loci containing sequences 1829 Microsatellite DNA locus identification and Number of sequences containing >1 microsatellite DNA 214 characterization locus Number of microsatellite DNA markers present in 120 Using MISA software, 2093 potential microsatellite loci were compound formation identified in 1973 unigenes (or contigs). The mono-, di-, tri-, Mono-nucleotide 562 tetra-, penta- and hexa-nucleotide repeats were 562, 292, Di-nucleotide 292 1110, 122, 7 and 0, respectively. Microsatellite abundance Tri-nucleotide 1110 decreased significantly as the motif repeat number increased. Tetra-nucleotide 122 For the mono-, di-, tri-, tetra- and penta-nucleotide repeats, Penta-nucleotide 7 the commonest repeat numbers were 10 repeats (42.17 %), 6 Hexa-nucleotide 0 repeats (50.34%), 5 repeats (54.23%), 5 repeats (92.62%) and 5 repeats (28.57%), respectively (Table 3). In decreasing order, the top 10 most frequently occurring microsatellites Discussion were A/T, C/G, AC/GT, AG/CT, AT/AT, AAC/GTT, The G. gratiosa transcriptome data consist of 74,821 unique AAG/CTT, AAT/ATT, ACC/GGT and ACG/CGT. The transcript sequences, which is larger than S. gregaria (12,709 longest microsatellites of mono-, di-, tri-, tetra- and unique transcript sequences; Badisco et al. 2011) and G. firmus penta-nucleotide were (A/T)23,(AC/GT)55, (AAG/CTT)30, (34,411 unique transcript sequences; Nanoth Vellichirammal (AATG/ATTC)15 and (AAGTG/ACTTC)6, respectively et al. 2014), similar to E. coerulipes (63,033 unique transcript (Table 4). Using the primer 3_2.2.3, three alternative primer sequences; Jin et al. 2015), but smaller than L. migratoria pairs for 591 microsatellite loci were satisfactorily generated (91,907 unique transcript sequences; Liu et al. 2014) and (Table S1). These microsatellite DNA markers will provide G. bimaculatus (142,317 non-redundant assembly products, useful tools in population genetic and molecular ecological including 21,512 isotigs and 120,805 singletons; Zeng et al. studies of G. gratiosa,whichmayalsobeusefulinother 2013). LocustDB currently hosts 45,474 high-quality EST closely related Gampsocleis species. sequences from the locust, which were assembled into 12,161 unigenes (Ma et al. 2006). N50 length is commonly used for assembly evaluation, and a high number suggests a high-quality assembly (Lander et al. 2001). The N50 and mean length of G. gratiosa transcriptome unique transcript sequences were 759 bp and 580 bp, respectively. This is similar to S. gregaria with N50 750 bp (Badisco et al.2011)andG. firmus with N50 513 bp (Nanoth Vellichirammal et al. 2014). It is smaller than L. migratoria with N50 2275 bp (Chen et al. 2010) and N50 1024 bp, mean 610 bp (Liu et al. 2014); G. bimaculatus 21,512 isotigs, N50 = 2,133 bp (Zeng et al.2013);andE. coerulipes with N50 1,589 bp, mean 772 bp (Jin et al.2015). The annotations of unigenes provided a valuable resource for investigating specific processes, functions and pathways in orthopteran research. In the terms of transcript annotation rate, 29,674 (39.66%) transcripts of G. gratiosa were annotated, whereas the rates were 4000 (31.47%) for S. gregaria (Badisco et al. 2011), 10,590 (14.5%) (Chen et al. 2010) and 23,359 Figure 4 Distribution of G. gratiosa transcripts among Kyoto Encyclopedia of Genes and Genomes (KEGG). The most highly unigenes (25.4%) for L. migratoria (Liu et al. 2014); 14,095 represented pathways (>500 transcripts) are shown. Analysis was (41.5%) for G. firmus (Nanoth Vellichirammal et al. 2014); performed using Blast2GO and the KEGG database. and 25,132 (39.87%) for E. coerulipes (Jin et al. 2015). There

202 Entomological Research 46 (2016) 197–205 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Transcriptome and microsatellite markers

Table 4 Frequency of classified repeat types (considering sequence complementary) from G. gratiosa transcriptome†

Repeats 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 27 30 55 Total

A/T —————22311459482215172413232 515 C/G —————14 25 5 2 1 47 AC/GT — 73 20 15 2 5 11 4 4 134 AG/CT — 554433 6101 122 AT/AT — 19 8 2 7 36 AAC/GTT 75 27 41 2 145 AAG/CTT 139 58 66 1 1 265 AAT/ATT 38 21 12 1 72 ACC/GGT 64 26 22 5 1 118 ACG/CGT 32 4 3 1 1 41 ACT/AGT 12 7 18 2 39 AGC/CTG 100 34 28 1 163 AGG/CCT 45 6 8 1 1 61 ATC/ATG 87 70 34 3 194 CCG/CGG 10 147 1 1 12 AAAC/GTTT 9 9 AAAG/CTTT 23 23 AAAT/ATTT 6 3 1 10 AACT/AGTT 2 1 3 AAGG/CCTT 5 5 AAGT/ACTT 13 1 14 AATC/ATTG 15 1 16 AATG/ATTC 4 1 5 ACAG/CTGT 5 5 ACAT/ATGT 19 19 ACGG/CCGT 1 1 ACTC/AGTG 3 1 4 ACTG/AGTC 1 1 AGAT/ATCT 7 7 AAAGC/CTTTG 1 1 AAGTG/ACTTC 5 5 ACTGC/AGTGC 1 1

†Blank cell indicates no data. were 19,874 (isotigs + singletons) unique BLAST hits COG and KEGG classifications of E. coerulipes were then against NR for 142,317 non-redundant assembly products compared with our data. (21,512 isotigs + 120,805 singletons) of G. bimaculatus GO classification based on sequence homology revealed transcriptome (Zeng et al. 2013). that 11,935 out of the assembled unigenes were categorized Whereas S. gregaria focused on transcript information into 61 functional groups in G. gratiosa. A similar result was from the central nervous system, G. firmus focused on fat found in E. coerulipes, in which 11,558 (18.34%) were body and flight muscles transcripts and G. bimaculatus categorized into 58 functional groups (Jin et al. 2015). The focused on embryonic and ovarian tissues. The novel G. differences between G. gratiosa and E. coerulipes were found gratiosa transcriptome pooled multiple tissue types from in both biological process and cellular component categories. two adult individuals (one male and one female) to For biological processes, four functional groups, “regulation maximize the chance of revealing as many genes as possible. of biological process” (2938), “negative regulation of LocustDB contains EST data derived from primary cDNA biological process” (760), “positive regulation of biological libraries for head, hind leg, midgut and whole organisms process” (660) and “carbon utilization” (1), were not found of 5th larval stage, and E. coerulipes transcriptome from in E. coerulipes.However,only“nucleoid” was not found in whole organisms of the female 5th larval stage. Therefore, the G. gratiosa cellular component. it is not surprising that there are distinct differences among For COG classification, both G. gratiosa (8079, 10.80%) these orthopteran transcript sequences, and only the GO, and E. coerulipes (8013, 12.71%) unigenes were clustered into

Entomological Research 46 (2016) 197–205 203 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Z-J. Zhou et al

25 functional categories, of which “general function Angeloni F, Wagemaker CAM, Jetten MSM et al. (2011) De novo prediction only” was the major COG category, followed by transcriptome characterization and development of genomic tools “replication, recombination and repair” and “translation, for Scabiosa columbaria L. using next-generation sequencing ribosomal structure and biogenesis” (Jin et al.2015). techniques. Molecular Ecology Resources 11:662–674. For KEGG classification, G. gratiosa 17,971 (24.02%) Badisco L, Huybrechts J, Simonet G et al. (2011) Transcriptome unigenes were assigned to 258 KEGG pathways, whereas E. analysis of the desert locust central nervous system: production coerulipes 7218 (11.46%) unigenes were assigned to 218 and annotation of a Schistocerca gregaria EST database. PLoS 6 known pathways (Jin et al. 2015). The number and proportion One (3): e17274. of unigenes assigned to KEGG pathways in G. gratiosa were Barker GC (2002) Microsatellite DNA: a tool for population genetic analysis. Transactions of the Royal Society of Tropical Medicine obviously higher than E. coerulipes. and Hygiene 96(Suppl 1): S21–24. Microsatellite markers have been documented as a high- Chapuis MP, Lecoq M, Michalakis Y et al.(2008)Dooutbreaks potential molecular tool for genetic diversity study in affect genetic population structure? A worldwide survey in various organisms. In total 2093 potential microsatellite Locusta migratoria, a pest plagued by microsatellite null alleles. G gratiosa loci were identified in the . transcriptome, which Molecular Ecology 17:3640–3653. was less than half of the number for E. coerulipes (Jin Chen SA, Yang PC, Jiang F et al. (2010) De novo analysis of et al. 2015). There were 5696 potential microsatellites loci transcriptome dynamics in the migratory locust during the in the E. coerulipes transcriptome, in which the di- development of phase traits. PLoS One 5(12): e15633. nucleotide repeat (39.80%) was the most abundant, Conesa A, Gotz S, Garcia-Gomez JM et al. (2005) Blast2GO: a followed by mono-nucleotide (35.25%), tri-nucleotide universal tool for annotation, visualization and analysis in (23.35%), tetra-nucleotide (1.44%), penta-nucleotide functional genomics research. Bioinformatics 21:3674–3676. (0.11%) and hexa-nucleotide (0.05%) (Jin et al. 2015). Crawford JE, Guelbeogo WM, Sanou A et al.(2010)Denovo However, the most common microsatellite loci repeat was transcriptome sequencing in Anopheles funestus using Illumina the tri-nucleotide repeat (53.03%) in the G. gratiosa RNA-seq technology. PLoS One 5(12): e14202. transcriptome. Gao Y, Kang L (2006) Effects of mating status on copulation Our study here is the first report to our knowledge of investment by male bushcricket Gampsocleis gratiosa transcriptome-wide identification and characterization of (, ). Science in China. Series C, Life microsatellite markers of the Chinese bush cricket, G. Sciences 49: 349–353. gratiosa. These microsatellite markers will provide a Gauthier N, Dalleau-Clouet C, Fargues J et al. (2007) Microsatellite good basis for investigating the genetic diversity of G. variability in the entomopathogenic fungus Paecilomyces gratiosa and other closely related species, and will fumosoroseus: genetic diversity and population structure. serve as a useful tool for DNA profiling. These data should Mycologia 99:693–704. help researchers investigate the evolution and biological Grabherr MG, Haas BJ, Yassour M et al. (2011) Full-length processes of this species. However, further evaluation transcriptome assembly from RNA-Seq data without a reference 29 – of their polymorphism is needed and the utility of these genome. Nature Biotechnology :644 652. microsatellites for primer cross-amplification between Harris MA, Deegan JI, Lomax J et al. (2008) The Gene Ontology 36 – different G. gratiosa geographic locations and Gampsocleis project in 2008. Nucleic Acids Research :D440 D444. species should be tested in the future. Huang J, Li YZ, Du LM et al. (2015) Genome-wide survey and analysis of microsatellites in giant panda (Ailuropoda melanoleuca), with a focus on the applications of a novel microsatellite marker system. BMC Genomics 16:61. Acknowledgements Jin YL, Cong B, Wang LY et al. (2015) An analysis of the transcriptome of Epacromius coerulipes (Orthoptera: Acrididae). This work was supported by National Natural Science Acta Entomologica Sinica 58:817–825. Foundation of China (31471985) and Excellent Youth Kanehisa M, Goto S (2000) KEGG: Kyoto encyclopedia of genes Scholars Program of Higher Education of Hebei Province and genomes. Nucleic Acids Research 28:27–30. (BJ2014006). The funders had no role in study design, data Kou X, Liu J, Zhou Z et al. (2015) Cloning and bioinformatic collection and analysis, decision to publish, or preparation of analysis of VASA cDNA from Gampsocleis gratiosa. Journal the manuscript. of Environmental Entomology 37: 558–566. Lander ES, Linton LM, Birren B et al. (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921. References Li P, Deng WQ, Li TH et al. (2013) Illumina-based de novo Altschul S, Gish W, Miller W et al. (1990) Basic local alignment transcriptome sequencing and analysis of Amanita exitialis search tool. Journal of Molecular Biology 215: 403–410. basidiocarps. Gene 532:63–71.

204 Entomological Research 46 (2016) 197–205 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd Transcriptome and microsatellite markers

Liu J, Zhou Z, Chang Y (2013) Full-long cDNA sequence cloning Gampsocleis gratiosa (Orthoptera, Tettigoniidae). Acta and bioinformatic analysis of Piwi subfamily member Giwi in Entomologica Sinica 57: 1162–1170. the Gampsocleis gratiosa gonads. Biotechnology Bulletin 2: Wei L, Li SH, Liu SG et al. (2014) Transcriptome analysis of 111–117. Houttuynia cordata Thunb. by Illumina paired-end RNA Liu SN, Wei W, Chu Y et al. (2014) De novo transcriptome analysis sequencing and SSR marker discovery. PLoS One 9(1): e84105. of wing development-related signaling pathways in Locusta Yao B, Zhao Y, Zhang H et al. (2012) Sequencing and de novo migratoria Manilensis and Ostrinia furnacalis (Guenee). PLoS analysis of the Chinese Sika deer antler-tip transcriptome during One 9(9): e106770. the ossification stage using Illumina RNA-Seq technology. Ma Z, Yu J, Kang L (2006) LocustDB: a relational database for the Biotechnology Letters 34: 813–822. transcriptome and biology of the migratory locust (Locusta Ye J, Fang L, Zheng H et al. (2006). WEGO: a web tool for plotting migratoria). BMC Genomics 7:11. GO annotations. Nucleic Acids Research 34(Web Server issue): Nanoth Vellichirammal N, Zera AJ, Schilder RJ et al. (2014) De W293–W297. novo transcriptome assembly from fat body and flight muscles Yue XY, Liu GQ, Zong Y et al. (2014) Development of genic SSR transcripts to identify morph-specific gene expression profiles markers from transcriptome sequencing of pear buds. Journal of in Gryllus firmus. PLoS One 9(1): e82129. Zhejiang University-Science B 15:303–312. Park M, Kim KS, Lee JH (2012) Isolation and characterization of Zeng V, Ewen-Campen B, Horch HW et al. (2013) Developmental eight microsatellite loci from Lycorma delicatula (White) gene discovery in a hemimetabolous insect: de novo assembly (Hemiptera: Fulgoridae) for population genetic analysis in Korea. and annotation of a transcriptome for the cricket Gryllus Molecular Biology Reports 39: 5637–5641. bimaculatus. PLoS One 8(5): e61479. Peng YL, Gao XF, Li RY et al. (2014) Transcriptome sequencing Zhang Q, Yuan M (2013) Progress in insect transcriptomics based and de novo analysis of Youngia japonica using the Illumina on the next-generation sequencing technique. Acta Entomologica platform. PLoS One 9(3): e90636. Sinica 56: 1489–1508. Scholl K, Allen JM, Leendertz FH et al. (2012) Variable Zhang X, Feng Y, Ding WF et al. (2011) Establishment and microsatellite loci for population genetic analysis of Old characterization of an embryonic cell line from Gampsocleis Worldmonkeylice(Pedicinus sp.). Journal of Parasitology gratiosa (Orthoptera: Tettigoniidae). In Vitro Cellular & 98: 930–937. Developmental Biology - 47:327–332. Schuster SC. (2008) Next-generation sequencing transforms Zhou Z, Shi F, Huang Y. (2008) The complete mitogenome of today’s biology. Nature Methods 5:16–18. the Chinese bush cricket, Gampsocleis gratiosa (Orthoptera: Tatusov RL, Galperin MY, Natale DA et al. (2000) The COG Tettigonioidea). Journal of Genetics and Genomics 35: database: a tool for genome-scale analysis of protein functions 341–348. and evolution. Nucleic Acids Research 28:33–36. Untergasser A, Cutcutache I, Koressaar T et al. (2012) Primer3 – new capabilities and interfaces. Nucleic Acids Research Supporting information 40(15): e115. Wang X, Chang Y, Zhao Z et al. (2014) Morphological and Additional supporting information might/can be found in the structural observation of the nuclei during spermiogenesis in supporting information tab for this article.

Entomological Research 46 (2016) 197–205 205 © 2016 The Entomological Society of Korea and John Wiley & Sons Australia, Ltd