
Genome Survey Sequencing of zingiberensis

Wen Zhou+; Bin Li+; Lin Li; Wen Ma; Yuanchu Liu; Shuchao Feng; Zhezhi Wang*

1 Key Laboratory of the Ministry of Education for Medicinal Resources and Natural Pharmaceutical Chemistry, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China
2 National Engineering Laboratory for Resource Development of Endangered Chinese Crude Drugs in Northwest China, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China

+ These authors contributed equally to this work.

*Correspondence: Prof. ZheZhi WANG; [email protected]; Tel.: +862985310260

Keyword: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis

1 Genome survey sequencing of Dioscorea zingiberensis


3 Wen Zhou+; Bin Li+; Lin Li; Wen Ma; Yuanchu Liu; Shuchao Feng; Zhezhi Wang*


5 1 Key Laboratory of the Ministry of Education for Medicinal Resources and Natural

6 Pharmaceutical Chemistry, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China

7 2 National Engineering Laboratory for Resource Development of Endangered Chinese Crude

8 Drugs in Northwest China, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China

9 10 + These authors contributed equally to Draftthis work. 11

12 *Correspondence: Prof. ZheZhi WANG; [email protected]; Tel.: +862985310260

13 Abstract

14 Dioscorea zingiberensis (Dioscoreceae) is the main source of diosgenin (steroidal

15 sapogenins), the precursor for the production of steroid hormones in the pharmaceutical industry.

16 Despite its large economic value, genomic information of this Dioscorea genus is currently

17 unavailable. Here, we present an initial survey of the D. zingiberensis genome performed by

18 nextgeneration sequencing technology together with a genome size investigation inferred by flow

19 cytometry. The whole genome survey of D. zingiberensis generated 31.48 Gb of sequence data

20 with approximately 78.70× coverage. The estimated genome size is 800 Mb, with a high level of

21 heterozygosity based on Kmer analysis. These reads were assembled into 334,288 contigs with a 22 N50 length of 1,079 bp, which were further assembled into 92,163 scaffolds with a total length of 23 173.46 Mb. A total of 4935 genes, 81 tRNAs, 69 rRNAs, and 661 miRNAs were predicted by the

24 genome analysis, and 263,484 repeated sequences were obtains with 419,372 simple sequence

25 repeats (SSRs). Among these SSRs, the mononucleotide repeat type was the most abundant (up to

26 54.60% of the total SSRs), followed by the dinucleotide (29.60%), trinucleotide (11.37%),

27 tetranucleotide (3.53%), pentanucleotide (0.65%), and hexanucleotide (0.25%) nucleotide repeat

28 types. The 1Cvalue of D. zingiberensis was calibrated against Salvia miltiorrhiza and calculated

29 as 0.87 pg (851 Mb) by flow cytometry, which was very close to the result of the genome survey.

30 This is the first report of genomewide characterization within this taxon.

31 Key Words: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis

32 Introduction

33 D. zingiberensis is an important and widely used medicinal herb in Traditional Chinese Medicine

34 (TCM). It has been applied for the treatment of various diseases, such as cough, anthrax,

35 rheumatoid arthritis, and sprains as well as cardiac diseases (Li et al. 2010a; Qin et al. 2009).

36 Plentiful diosgenin, a type of steroidal saponin extracted from the rhizomes of D. zingiberensis, is

37 an important steroidal precursor used in the pharmaceutical industry. In the medical industry,

38 diosgenin is widely used as the starting material for the synthesis of many steroidal drugs (e.g.,

39 antioxidants, antiinflammatories, androgen, oestrogen, and contraceptives) due to the similarity in

40 their skeletons (Bertrand et al. 2009; Wang et al. 2007). More importantly, steroidal sapogenins 41 are attractive to many synthetic and medicinal chemists aiming to harness their anticancer activity 42 (Minato et al. 2013). As the demand of the global market increases at 8% annually, steroid

43 hormones such as sexual hormones, cortical hormones, and protein anabolic hormones call for a

44 matching supply of the precursor to be produced (Bai et al. 2015).

45 At present, the extraction process of diosgenin from D. zingiberensis usually generates plenty of

46 highacid and highstrength wastewater, which cannot be ignored as a great threat to the

47 environment. In consideration of this, microorganism bioengineering is an effective method for

48 producing diosgenin. However, genetic studies on D. zingiberensis remain underdeveloped

49 compared with many other herbs, such as Salvia miltiorrhiza (Wenping et al. 2011), Dendrobium

50 officinale (Liang et al. 2015), and Ganoderma lucidum (Chen et al. 2012), which might be due to

51 the insufficient genetic or genomic resources available for D. zingiberensis.

52 In recent years, great advances in genome survey sequencing technology and bioinformatics have

53 opened a new avenue to characterize the genetic background of organisms, e.g., Myricarubra (Jiao

54 et al. 2012), Gracilariopsis lemaneiformis (Zhou et al. 2013), Fagopyrum tartaricum (Hou et al.

55 2016), and others. Compared with the conventional methods for gene cloning and sequencing, the

56 new generation sequencing technology affords a quick, easy, and fullscale method of

57 investigation. To investigate and provide a genomic resource for further research (e.g., structural

58 and functional genomics studies, molecular cloning, comparative and evolutionary studies) on this

59 , we conducted a genome survey of D. zingiberensis using NGS technology. This study

60 could pave the way for accelerating the progress of gene discovery and better utilization of the

61 existing genomic information in the future.

62 Materials and methods 
63 Plant materials 
64 D. zingiberensis was collected from Xunyang County, Shaanxi Province, China. Voucher

65 specimens were prepared and identified by Prof. Tian Xianhua (College of Life Sciences, Shaanxi

66 Normal University, Xi’an, P. R. China) and then deposited at the Key Laboratory of Ministry of

67 Education for Medicinal Resources and Natural Pharmaceutical Chemistry, Shaanxi Normal

68 University. Young were collected and frozen in liquid nitrogen and stored at –80°C prior to

69 genomic DNA extraction using the Plant Genomic DNA Kit (Tiangen biotech, Beijing, China)

70 following the manufacturer’s instructions. The extracts were electrophoresed on 1% agarose to

71 confirm the DNA quality and quantity. The concentrations of nucleic acids and proteins were

72 measured spectrophotometrically at 260 nm on a BioPhotometer (Eppendorf, Germany).

73 Genome size estimation by flow cytometry

82 Salvia miltiorrhiza, (1C = 0.66 pg DNA, (Zhang et al. 2015)) served as an internal reference

83 standard. One to two young leaves per plant, equivalent to 300500 mg, were excised and placed

84 into a 100 mm Petri dish. To this, 1.5 mL of LB01 buffer (Dpoležel et al. 1989) was added, and

85 the two types of tissue were chopped simultaneously with a razor for 30 s (~60 chops per sample)

86 to release the nuclei. The resulting homogenate was filtered through a 48 µm nylon filter into a 1.5

87 mL tube. Then, the nuclear suspension was stained with 10 µL of PI (10 mg/mL), and 10 µL of

88 RnaseA (10 mg/mL) was added immediately to prevent the staining of a doublestandard RNA.

89 The samples were incubated on ice for 10 minutes. Then, the aqueous suspension of intact nuclei

90 from the samples and the internal reference DNA standard were analysed on a NovoCyte machine

91 (ACEA Biosciences, Inc.) with Novoexpress software (Version A green argon laser at

92 a wavelength of 488 nm was used as the light source, and the flow of at least 10000 nuclei was 93 measured in the sample. 
94 Genome sequencing and sequence assembly

95 Two pairedend libraries with an insert size of 220 base pairs (bp) were constructed from

96 fragmented random genomic DNA following the manufacturer’s instructions (Illumina, Beijing,

97 China). Sequence data were generated by Beijing Biomarker Technologies Co., Ltd. (Beijing,

98 China) using an Illumina HiSeq 2500 sequencing platform. The short tips and low quality

99 sequences of the raw genome survey sequence data were filtered to obtain high quality reads,

100 which were subsequently used for assembly with SOAP de novo software (Li et al. 2010b). All

101 sequencing reads were deposited in the Short Read Archive (SRA) database

102 (, and they are retrievable under the accession number

103 SRX3235157.

104 Genome size estimation by k-mer analysis

105 In shotgun genome sequencing, short reads are assumed to be randomly generated, so any kmers

106 in the reads also occur randomly. Their depth of coverage follows the Poisson distribution (Li and

107 Waterman 2003), and the mean kmer depth should be equal to the peak value of the kmer depth

108 distribution. Two pairedend libraries with insert sizes of approximately 220 bp and 500 bp were

109 sequenced on one lane of the Illumina HiSeq 2500 system with the pairedend 150 bp. The

110 highquality Illumina sequences generated from these two genomic libraries were applied to kmer

111 counting using SOAPec (v2.01) in the SOAP de novo software package. Then, based on the Kmer

112 analysis, information on the peak depth and the number of 17mers was obtained. Thus, the size of

113 the genome and heterozygosis can be estimated using the following formula: Genome size =

114 Kmernum / Peak depth relatively (Varshney et al. 2012). 
115 Guanine plus cytosine (GC) content analysis 
116 The level of GC content is an important attribute of plant (and other living organisms) genomes.

117 The GC content is strictly controlled and moderately balanced across the genome (Parker et al.

118 2008). Kmer sizes of 20, 37, 55, 63, 71, 77, 83, and 95 were examined using default parameters,

119 and the optimal kmer size was selected based on the N50 length. The usable reads > 200 bases in

120 length were selected to realign the contig sequences because the sequences < 200 bp were likely

121 to be derived from repetitive or lowquality sequences (Lu et al. 2016). Finally, the GC average

122 sequencing depth was calculated by the 10kb nonoverlapping sliding windows along the

123 assembled sequence.

124 SSR identification

125 The Perl script MIcro SAtellite (MISA) was used to identify microsatellites in D. zingiberensis

126 genomes (Thiel et al. 2003). We used MISA scripting language

127 ( with default parameters to identify SSR in our

128 sequence database. Through the analysis of genome sequences, six types of SSR can be identified:

129 mono, di, tri, tetra, penta, and hexa nucleotide SSR.

130 Gene prediction and annotation

131 The raw survey data and transcriptome data (Hua et al. 2017) were used to predict and annotate

132 genes. After filtering the scaffolds of < 1000 bp in size, Gensan, a software that identifies

133 complete exon / intron structures of genes in genomic DNA, was applied to the gene identification

134 with parameters trained on D. Zingiberensis (Burge and Karlin 1997). Additionally, TransDecoder

135 v2.0 (Haas et al. 2013) and GeneMarkST v5.1 (Tang et al. 2015) software were utilized to predict

136 genes according to the transcriptome database. Each predicted gene was annotated by BLAST 137 alignment to the GenBank database and then analysed between predicted genes and common 138 databases such as plant Gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes

139 (KEGG), Eukaryotic clusters of Orthologous Groups (KOG), Nr, TrEMBL, and Clusters of

140 Orthologous Groups (COG). Meanwhile, the described genes were classified into the GO

141 categories and then mapped onto the KEGG reference pathways (Hirakawa et al. 2015).

142 Results

143 Genome size estimation by flow cytometry

144 The flow cytometric analyses yielded a highresolution histogram with CVs and mean values of

145 the tetraploid D. zingiberensis and the internal standard Salvia miltiorrhiza (Fig. 1). The CVs were

146 0.80% and 2.01% for D. zingiberensis and Salvia miltiorrhiza, respectively. The peak ratio was

147 calculated as 2.38, meaning the 1Cvalue of D. zingiberensis was 0.87 pg (1.32 × 0.66 pg = 0.87

148 pg).

149 Genome sequencing and sequence assembly

150 A total of 31.48 Gb sequence data were generated from the smallinsert (220 bp) library, with 91.1%

151 Q30 bases (base quality > 30), which was required for successful assembly, approximately 78.70×

152 coverage (Table 1). A large N50 contig and contig number might simply reflect a continuous and

153 complete assembly (Li et al. 2012). The 31.48 Gb clean reads were used to conduct de novo

154 assembly. Kmer sizes of 20, 37, 55, 63, 71, 77, 83, and 95 were examined using default

155 parameters. Assembly with kmer 63 by SOAP de novo was selected because it has the optimal

156 reading for N50 (Table 2), which is defined as a weighted median and is the smallest contig size in

157 the set with a combined length totalling 50% of the genome assembly (Carneiro et al. 2012; Earl et

158 ~40.16 kb (Table 2) was also generated. The total gap length (Ns) was ~2.91 Mb.

161 ~40.16 kb (Table 2) was also generated. The total gap length (Ns) was ~2.91 Mb.

162 Genome size estimation

163 Based on the Kmer analysis, a total of 27.24 Gb (Table 3) clean data were used to count and plot

164 the distribution of 17mer frequency after filtering out the chloroplast sequencing data to estimate

165 the genome size of D. Zingiberensis. For the 17mer frequency distribution (Figure 1), the average

166 Kmer depth and the main peak of the depth was at ~57×. Likewise, the repeat peak was at the

167 position of the integer multiples of the main peak (~114×). The heterozygosis rate appeared at a

168 position of half of the height of the main peak (~28×), whereas the minor peak clearly appeared at

169 a position of a quarter of the height of the main peak (~15×). Thus, it was doubted to be

170 autotetraploid. According to the genetic background of this species, diploids and tetraploids occur

171 in nature (Huang et al. 2010). We deduced the sample to be autotetraploid. As a result, we

172 estimated the genome size to be 800.00 Mb, calculated by using the following algorithm: Genome

173 size = Kmernum / Peak depth. The genome size of repetitive sequences was approximately 42.81%

174 of the D. zingiberensis genome, which was estimated to be 342.48 Mb. The heterozygosity

175 indicates approximately 1.37% belonging to the complex genome of the higher heterozygosis rate.

176 Guanine plus cytosine (GC) content analysis

177 To measure the genomewide sequencing bias, the GC content and average sequencing depth were

178 plotted using nonoverlapping 10kb sliding windows along the assembled sequence (Figure 2).

179 The GC content of the genome varies in different plant species. A toohigh (>65%) and toolow

180 (<25%) GC content may cause sequence bias on the Illumina sequencing platform, thus seriously 181 affecting genome assembly (Aird et al. 2011). The average GC content of the D. zingiberensis 182 genome was 39.12% (Table 2),which was higher than for Arabidopsis thaliana (36%) (Barakat et

183 al. 1998) and potatoes (34.836.0%) (Consortium et al. 2011; Hirakawa et al. 2015) but lower than

184 that of some marine macroalgae, such as Cyanidio schyzonmerolae (55.0%) (Ohta et al. 2003),

185 Solieria filiformis (48.6%) (Dalmon and Loiseaux 1981), Chondrus crispus (46.3%) (Gall et al.

186 1993), and Laminaria hyperborea (42.6%) (Stam et al. 1988). Therefore, the D. zingiberensis

187 genome was of midGC content. Moreover, the GC depth was slightly blocked into 4 layers

188 (Figure 2), which was in part caused by the polyploidy and the 1.37% high heterozygosity rate.

189 SSR identification

190 A total of 3,548,310 sequences were examined from the genome survey sequence containing

191 419,372 SSRs (Table 4). The mononucleotide repeats showed a predominant type, which

192 accounted for 54.60% of the observed SSRs, followed by the di (29.60%), tri (11.37%), tetra

193 (3.53%), penta (0.65%) and hexa (0.25%) nucleotide repeat types (Table 4). Mononucleotide

194 repeats have been reported to be the most common type of repeat whether in monocot species,

195 such as rice, sorghum, and Brachypodium, or in dicot species, for example, Arabidopsis, Medicago,

196 and Populus, which accounted for 79% in Medicago at most (Sonah et al. 2011).

197 In addition, 354 motif types were identified, comprising mono (4), di (8), tri (30), tetra (86),

198 penta (130), and hexa (96) nucleotide types (Table S1). Among the repeat motifs of the

199 dinucleotide, the TA/TA and AT/AT repeat were the most two abundant types, which accounted for

200 50.44% and 48.77%, respectively, followed by 19.21% CT/AG repeats (Figure 3). The

201 predominant motifs of the trinucleotide were ATT/AAT and TAA/TTA, accounting for 22.25% and

202 13.92%, respectively (Figure 4). 
203 Gene prediction and annotation 
204 Based on the combination of genome sequencing and transcriptome data analysis of D.

205 zingiberensis, a total of 27,057 genes were predicted by Genescan and EVM (Altschul et al. 1990)

206 (Table 5). The average length of the putative genes identified was 911 bp, and the average exon

207 length and intron length were 544 bp and 236 bp, respectively. The putative genes were aligned by

208 Blast to the NR (Marchlerbauer et al. 2011), KOG (Tatusov et al. 2001), GO (Dimmer et al. 2012),

209 KEGG (Kanehisa and Goto 2000), and TrEMBL (B et al. 2003) databases, and 86.78% of the

210 putative genes were matched (Table 6).

211 GO annotations for the putative genes were obtained using the Blast2GO program (Conesa et al.

212 2005). Afterward, WEGO software (Ye et al. 2006) was applied to run GO functional

213 classifications for all genes and to understand the distribution of gene functions in D. zingiberensis

214 at the macro level. A total of 12,736 genes were identified by the GO slim analysis and further

215 classified into the categories of molecular function, cellular component, and biological process

216 (Figure 5). Specifically, 35.79%, 19.82%, and 44.39% of the genes were grouped under cellular

217 components, molecular functions, and biological processes, respectively. Furthermore, cell and cell

218 part (24.37% and 24.59%, respectively) were the most significantly represented groups within

219 cellular components; catalytic activity (47.18%) represented a relatively high proportion within

220 molecular functions; and metabolic process (20.66%) was the most highly represented group

221 within biological processes.

222 Altogether, 13,432 putative genes were classified into KOG functional categories. The largest

223 group was the cluster for general function prediction only (3,306; 24.61%), followed by signal

224 chaperones (1,269; 9.45%) (Figure S1).
226 Pathway assignments were made according to KEGG mapping (Kanehisa and Goto 2000). There

227 were 7,406 putative genes assigned to 127 KEGG pathways (Table S3). A total of 4,473 genes

228 (60.40%) were associated with 96 metabolic pathways, in which 1,174 (26.25%) were involved in

229 carbohydrate metabolism; next was genetic information processing (1,945; 26.26%), with 382

230 associated with environmental information processing, 350 associated with cellular processes, and

231 251 associated with organismal systems.

232 Discussion

233 Genome size, also known as the genomic content or DNA 1Cvalue,refers to the DNA content of

234 the gamete genome. Genome size is the basis for comparative and evolutionary genomics (Hirano

235 and Das 2012). We can comparatively analyse the genome size of different species and detect,

236 recognize and grasp the regularity of the genome variation. Flow cytometry has been regarded as a

237 standard method for the prediction of the genome size of , such as Lessingianthus

238 (Vernonieae, Asteraceae) (Angulo and Dematteis 2013), Capsicum (Solanaceae) (Moscone et al.

239 2003), and Bemisiatabaci (Aleyrodidae) (Brown et al. 2005). In addition, Feulgen

240 spectrophotometry (Ha et al. 2007) and pulsed field gel electrophoresis (PFGE) (Lingohr et al.

241 2009; Zhang and Fan 2002) have been proven to be effective methods to detect the genome size.

242 However, the development of NGS technologies have provided researchers with a more efficient

243 and affordable approach of proposing a wide range of problems relating to nonmodel species.

244 Such a method has been applied to the analysis of the genomes of Brassica juncea (Yang et al.

245 2014), Myricarubra (Jiao et al. 2012), and Gracilariopsis lemaneiformis (Zhou et al. 2013).

246 Among the 200250 species in the Dioscorea genus, D. zingiberensis is the most important in 247 terms of its high content of dioscin with medicinal value. However, its limited genomic 248 information has constrained the genetic studies on D. Zingiberensis. This article offers a brief

249 description of the genome size of D. Zingiberensis, providing the variation of genome size

250 references in the Dioscorea genus. This is the first report of genomewide characterization in the

251 Dioscorea genus, and it evaluated the heterozygosity rate, GC content and distribution of the

252 genome. The main conclusions of the study are as follows.

253 (1) The genome size, heterozygosity rate and GC content of D. zingiberensis is approximately 800

254 Mb, 1.37% and 39.12%, respectively, as estimated by the Kmer depth distribution of sequenced

255 reads. Compared with the other species in Dioscorea, the genome of D. Zingiberensis is relatively

256 small (Table 7). It is obvious that the genome size varied among different species in this genus. We

257 speculate the reason to be the varied extent of the amplification of repeat sequences occurring in

258 different species (Ohri and Khoshoo 1986; Wakamiya et al. 1993) and possible hybridization

259 between closely related taxa (Hall et al. 2000). The most likely explanation is the frequent

260 polyploidization events pushing the genome size variation during its evolution (Tian et al. 2008).

261 Furthermore, the result of the flow cytometric is consistent with the genome survey.

262 Arabidopsis has been documented as a model organism for genetic study, mainly because it has a

263 small genome (120 Mb) that is amenable to detailed molecular analysis (Meinke et al. 1998;

264 Meyerowitz and Pruitt 1985). In marine macroalgae, Pyropia was regarded as a model organism

265 for genetic studies, partly because its haploid has a relatively small genome size (270–530 Mb)

266 (YANG et al. 2011). The estimated genome size of D. zingiberensis (800 Mb) is much smaller than

267 other Dioscorea species, which shows the potential of D. zingiberensis as a model species in this

268 regard. 
269 (2) The genome survey identified a total of 419,372 SSR from the D. zingiberensis genome. SSRs 270 in plant genomes have been surveyed in many species, and the numbers were quite different

271 among these sequenced plants, such as Oryza sativa (70,531), Arabidopsis thaliana (15,249) and

272 Sorghum bicolor (73,658) (Sonah et al., 2011). The SSR number in D. zingiberensis was nearly six

273 times greater than that in Oryza sativa. The frequency of each motif of the 354 polymorphic SSRs

274 is presented in Table S1. The TA/TA repeat was the most prominent type, accounting for 50.44%.

275 It is possible that the 419,372 derived SSR loci found in our study may be used as SSR markers for

276 genetic mapping in the short term.

277 (3) The number of genes predicted by the genome survey only of D. zingiberensis was much lower

278 than that of other sequenced genomes such as Rosa roxburghii Trat (Lu et al. 2016), Prunusmume

279 (Zhang et al. 2012), and Prunuspersica (Verde et al. 2013). The reason should be the insufficient

280 sequence depth coverage and low sequence homology due to limited gene information from

281 closely related species (Zhou et al. 2013). However, after combining with the transcriptome data,

282 the number of annotated genes was increased significantly.

283 (4) The wholegenome shotgun (WGS) strategy is relatively difficult to assemble (Chen and

284 Pachter 2005). Instead, the complicated bacterial artificial chromosome (BAC) strategy could

285 resolve problems associated with the assembly of a heterozygous genome (Wu et al. 2013).

286 However, to ensure the genome integrity of the assembly, homozygous materials for genome

287 sequencing were always a priority (Shulaev et al. 2011; Xu et al. 2013).



291 Author Contributions

292 Conceived and designed the experiments: WZ BL. Performed the experiments: YCL WM.

293 Analysed the data: WZ LL. Contributed reagents/materials/analysis tools: SCF YCL. Wrote the

294 paper: WZ BL.


296 Data Availability Statement: The genome sequence reads obtained by Illumina Hiseq 2500 are

297 available at NCBISRA. The Bioproject accession number is PRJNA391240

298 (, and the Biosample accession number is

299 SAMN07259770 ( The Experiment 300 number is SRX3235157/Dioscorea zingiberensis and Run number is SRR6122503.

302 Funding: This work was supported by the Fundamental Research Funds for the Central

303 Universities (2017CSZ008) and the National Natural Science Foundation of China (31670299).

513 Figure 1. The analysis of D. zingiberensis genome size of by flow cytometry.

514 Figure 2. Distribution of 17mer frequency for estimating the genome size of

515 D. zingiberensis.

516 Figure 3. GC content and average sequencing depth of the genome data used

517 for assembly (the xaxis was the GC content percent across every 10kb

518 nonoverlapping sliding window).

519 Figure 4. Percentage of different motifs in dinucleotide repeats in D.

520 Zingiberensis.

521 Figure 5. Percentage of different motifs in trinucleotide repeats in D. 522 Zingiberensis. Draft 523 Figure 6. Gene Ontology classification. Genes were assigned to three

524 categories: cellular components, molecular functions, and biological

525 processes.


527 Supporting Information

528 1. Table S1. Occurrence of SSR motifs in Genome Survey to D.

529 Zingiberensis.(XLS)

530 2. Figure S2. Gene assignment to KOG functional categories in D.

531 Zingiberensis. (TIF)

532 3. Table S3. Number of genes mapped onto KEGG pathways. (XLS)


534 Table 1. Statistics of sequencing data. 535 Library (bp) Data (Gb) Depth (X) Q20 (%) Q30 (%) 220 31.48 78.7 95.08 91.1




538 Table 2. Statistics of the assembled genome sequences. 
539 Contigs Number of sequences 334,288 Total length (bases) 174,634,115 N50 length (bases) 1,079 N90 length (bases) 219 Max length (bases) 19,831 GC content (%) 39.83 Scaffolds Number of sequences 3,548,310 Total length (bases) 177,618,332 N50 length (bases) 1,955 N90 length (bases) 1,110 Max length (bases) 40,163 A 281,503,836 T 271,128,258 G 174,939,581 C 180,143,770 N 10,628,111 Total (ATGC) 907,715,445 G+C% (ATGC) 39.12



542 Table 3. Statistics of sequencing data after filtering out the chloroplast data 543 Library (bp) Data (Gb) Depth (X) Q20 (%) Q30 (%) 220 27.24 68.10 94.92 90.87




546 Table 4. Simple sequence repeat types detected in the D. zingiberensis sequences. 547 Searching Item Number Ratio Total number of sequences examined 3,548,310 Total size of examined sequences (bp) 918,343,556 Total number of identified SSRs 419,372 100% Number of SSR containing sequences 353,988 84.41% Number of sequences containing more than 1 SSR 50,592 12.06% Number of SSRs present in compound formation 33,439 7.97% Mono nucleotide 228,973 54.60% Di nucleotide 124,133 29.60% Tri nucleotide 47,681 11.37% Tetra nucleotide 14,815 3.53% Penta nucleotide 2,726 0.65% Hexa nucleotide 1,044 0.25%


550 Table 5. Statistics on gene information. 551 Software Gene Gene Average Exon Average Intron Average number gene exon intron Length (bp) EVM 27,057 24,659,071 911.37 14,722,7 544.14 6,388,8 236.12 44 07




554 Table 6. Statistics of gene functional annotation. 555 Annotation database Annotated number Percentage (%) GO 12,736 47.07 KOG 13,432 49.64 KEGG 8,395 31.03 NR 22,306 82.44 TrEMBL 22,024 81.40 All Annotated 2,915 86.78




558 Table 7. The genome size of the species in Dioscorea genus

Species Genome size(Mb) Original Reference

Dioscorea dumetorum 831.3 (Obidiegwu et al. 2009)

Dioscorea tokoro 430.3 (Veselý et al. 2012)

Dioscorea togoensis 469.4 (Hamon et al. 2011)

Dioscorea alata 567.2 (Arumuganathan and Earle 1991)

Dioscorea abyssinica 616.1 (Hamon et al. 2011)

Dioscorea mangenotiana 616.1 (Hamon et al. 2011)

Dioscorea praehensilis 616.1 (Hamon et al. 2011)

Dioscorearotundata 694.3 (Obidiegwu et al. 2009)

Dioscorea cayenensis 753.1 (Obidiegwu et al. 2009)

Dioscorea sylvatica 831.3 (Bharathan et al. 1994)

Dioscorea esculenta 1026.9 (Obidiegwu et al. 2009)

Dioscorea bulbifera 1173.6 (Obidiegwu et al. 2009)

Dioscorea villosa 2347.2 (Bharathan et al. 1994)

Dioscorea elephantipes 6601.5 (Zonneveld et al. 2005)


Figure 1. The analysis of D. zingiberensis genome size of by flow cytometry.

76x78mm (300 x 300 DPI) Genome Page 30 of 34


Figure 2. Distribution of 17-mer frequency for estimating the genome size of D. zingiberensis.

67x57mm (300 x 300 DPI) Page 31 of 34 Genome


Figure 3. GC content and average sequencing depth of the genome data used for assembly (the x-axis was the GC content percent across every 10-kb non-overlapping sliding window).

67x53mm (300 x 300 DPI) Genome Page 32 of 34


Figure 4. Percentage of different motifs in dinucleotide repeats in D. Zingiberensis.

72x52mm (300 x 300 DPI) Page 33 of 34 Genome

Figure 5. Percentage of differentDraft motifs in trinucleotide repeats in D. Zingiberensis.

126x77mm (300 x 300 DPI) Genome Page 34 of 34


Figure 6. Gene Ontology classification. Genes were assigned to three categories: cellular components, molecular functions, and biological processes.