Genome

Genome Survey Sequencing of zingiberensis

Journal: Genome

Manuscript ID gen-2018-0011.R2

Manuscript Type: Article

Date Submitted by the Author: 22-May-2018

Complete List of Authors: Zhou, Wen; Shaanxi Normal University, Life Science School Li, Bin; Shaanxi Normal University, Life Science School Li, Lin; Shaanxi Normal University, Life Science School Ma, wen; Shaanxi Normal University, Life Science School Liu, Yuanchu;Draft Shaanxi Normal University, Life Science School Feng, Shuchao; Shaanxi Normal University, Life Science School Wang, Zhezhi; Shaanxi Normal University, Life Science School

Keyword: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis

Is the invited manuscript for consideration in a Special N/A Issue? :

https://mc06.manuscriptcentral.com/genome-pubs Page 1 of 34 Genome

1 Genome survey sequencing of Dioscorea zingiberensis

2

3 Wen Zhou+; Bin Li+; Lin Li; Wen Ma; Yuanchu Liu; Shuchao Feng; Zhezhi Wang*

4

5 1 Key Laboratory of the Ministry of Education for Medicinal Resources and Natural

6 Pharmaceutical Chemistry, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China

7 2 National Engineering Laboratory for Resource Development of Endangered Chinese Crude

8 Drugs in Northwest China, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China

9 10 + These authors contributed equally to Draftthis work. 11

12 *Correspondence: Prof. ZheZhi WANG; [email protected]; Tel.: +862985310260

1

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 2 of 34

13 Abstract

14 Dioscorea zingiberensis (Dioscoreceae) is the main source of diosgenin (steroidal

15 sapogenins), the precursor for the production of steroid hormones in the pharmaceutical industry.

16 Despite its large economic value, genomic information of this Dioscorea genus is currently

17 unavailable. Here, we present an initial survey of the D. zingiberensis genome performed by

18 nextgeneration sequencing technology together with a genome size investigation inferred by flow

19 cytometry. The whole genome survey of D. zingiberensis generated 31.48 Gb of sequence data

20 with approximately 78.70× coverage. The estimated genome size is 800 Mb, with a high level of

21 heterozygosity based on Kmer analysis. These reads were assembled into 334,288 contigs with a 22 N50 length of 1,079 bp, which were furtherDraft assembled into 92,163 scaffolds with a total length of 23 173.46 Mb. A total of 4935 genes, 81 tRNAs, 69 rRNAs, and 661 miRNAs were predicted by the

24 genome analysis, and 263,484 repeated sequences were obtains with 419,372 simple sequence

25 repeats (SSRs). Among these SSRs, the mononucleotide repeat type was the most abundant (up to

26 54.60% of the total SSRs), followed by the dinucleotide (29.60%), trinucleotide (11.37%),

27 tetranucleotide (3.53%), pentanucleotide (0.65%), and hexanucleotide (0.25%) nucleotide repeat

28 types. The 1Cvalue of D. zingiberensis was calibrated against Salvia miltiorrhiza and calculated

29 as 0.87 pg (851 Mb) by flow cytometry, which was very close to the result of the genome survey.

30 This is the first report of genomewide characterization within this taxon.

31 Key Words: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis

2

https://mc06.manuscriptcentral.com/genome-pubs Page 3 of 34 Genome

32 Introduction

33 D. zingiberensis is an important and widely used medicinal herb in Traditional Chinese Medicine

34 (TCM). It has been applied for the treatment of various diseases, such as cough, anthrax,

35 rheumatoid arthritis, and sprains as well as cardiac diseases (Li et al. 2010a; Qin et al. 2009).

36 Plentiful diosgenin, a type of steroidal saponin extracted from the rhizomes of D. zingiberensis, is

37 an important steroidal precursor used in the pharmaceutical industry. In the medical industry,

38 diosgenin is widely used as the starting material for the synthesis of many steroidal drugs (e.g.,

39 antioxidants, antiinflammatories, androgen, oestrogen, and contraceptives) due to the similarity in

40 their skeletons (Bertrand et al. 2009; Wang et al. 2007). More importantly, steroidal sapogenins 41 are attractive to many synthetic and medicinalDraft chemists aiming to harness their anticancer activity 42 (Minato et al. 2013). As the demand of the global market increases at 8% annually, steroid

43 hormones such as sexual hormones, cortical hormones, and protein anabolic hormones call for a

44 matching supply of the precursor to be produced (Bai et al. 2015).

45 At present, the extraction process of diosgenin from D. zingiberensis usually generates plenty of

46 highacid and highstrength wastewater, which cannot be ignored as a great threat to the

47 environment. In consideration of this, microorganism bioengineering is an effective method for

48 producing diosgenin. However, genetic studies on D. zingiberensis remain underdeveloped

49 compared with many other herbs, such as Salvia miltiorrhiza (Wenping et al. 2011), Dendrobium

50 officinale (Liang et al. 2015), and Ganoderma lucidum (Chen et al. 2012), which might be due to

51 the insufficient genetic or genomic resources available for D. zingiberensis.

52 In recent years, great advances in genome survey sequencing technology and bioinformatics have

53 opened a new avenue to characterize the genetic background of organisms, e.g., Myricarubra (Jiao

3

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 4 of 34

54 et al. 2012), Gracilariopsis lemaneiformis (Zhou et al. 2013), Fagopyrum tartaricum (Hou et al.

55 2016), and others. Compared with the conventional methods for gene cloning and sequencing, the

56 new generation sequencing technology affords a quick, easy, and fullscale method of

57 investigation. To investigate and provide a genomic resource for further research (e.g., structural

58 and functional genomics studies, molecular cloning, comparative and evolutionary studies) on this

59 , we conducted a genome survey of D. zingiberensis using NGS technology. This study

60 could pave the way for accelerating the progress of gene discovery and better utilization of the

61 existing genomic information in the future.

62 Materials and methods 63 Plant materials Draft 64 D. zingiberensis was collected from Xunyang County, Shaanxi Province, China. Voucher

65 specimens were prepared and identified by Prof. Tian Xianhua (College of Life Sciences, Shaanxi

66 Normal University, Xi’an, P. R. China) and then deposited at the Key Laboratory of Ministry of

67 Education for Medicinal Resources and Natural Pharmaceutical Chemistry, Shaanxi Normal

68 University. Young were collected and frozen in liquid nitrogen and stored at –80°C prior to

69 genomic DNA extraction using the Plant Genomic DNA Kit (Tiangen biotech, Beijing, China)

70 following the manufacturer’s instructions. The extracts were electrophoresed on 1% agarose to

71 confirm the DNA quality and quantity. The concentrations of nucleic acids and proteins were

72 measured spectrophotometrically at 260 nm on a BioPhotometer (Eppendorf, Germany).

73 Genome size estimation by flow cytometry

82 Salvia miltiorrhiza, (1C = 0.66 pg DNA, (Zhang et al. 2015)) served as an internal reference

83 standard. One to two young leaves per plant, equivalent to 300500 mg, were excised and placed

4

https://mc06.manuscriptcentral.com/genome-pubs Page 5 of 34 Genome

84 into a 100 mm Petri dish. To this, 1.5 mL of LB01 buffer (Dpooležel et al. 1989) was added, and

85 the two types of tissue were chopped simultaneously with a razor for 30 s (~60 chops per sample)

86 to release the nuclei. The resulting homogenate was filtered through a 48 µm nylon filter into a 1.5

87 mL tube. Then, the nuclear suspension was stained with 10 µL of PI (10 mg/mL), and 10 µL of

88 RnaseA (10 mg/mL) was added immediately to prevent the staining of a doublestandard RNA.

89 The samples were incubated on ice for 10 minutes. Then, the aqueous suspension of intact nuclei

90 from the samples and the internal reference DNA standard were analysed on a NovoCyte machine

91 (ACEA Biosciences, Inc.) with Novoexpress software (Version 1.2.4.1602). A green argon laser at

92 a wavelength of 488 nm was used as the light source, and the flow of at least 10000 nuclei was 93 measured in the sample. Draft 94 Genome sequencing and sequence assembly

95 Two pairedend libraries with an insert size of 220 base pairs (bp) were constructed from

96 fragmented random genomic DNA following the manufacturer’s instructions (Illumina, Beijing,

97 China). Sequence data were generated by Beijing Biomarker Technologies Co., Ltd. (Beijing,

98 China) using an Illumina HiSeq 2500 sequencing platform. The short tips and low quality

99 sequences of the raw genome survey sequence data were filtered to obtain high quality reads,

100 which were subsequently used for assembly with SOAP de novo software (Li et al. 2010b). All

101 sequencing reads were deposited in the Short Read Archive (SRA) database

102 (http://www.ncbi.nlm.nih.gov/sra/), and they are retrievable under the accession number

103 SRX3235157.

104 Genome size estimation by k-mer analysis

105 In shotgun genome sequencing, short reads are assumed to be randomly generated, so any kmers

5

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 6 of 34

106 in the reads also occur randomly. Their depth of coverage follows the Poisson distribution (Li and

107 Waterman 2003), and the mean kmer depth should be equal to the peak value of the kmer depth

108 distribution. Two pairedend libraries with insert sizes of approximately 220 bp and 500 bp were

109 sequenced on one lane of the Illumina HiSeq 2500 system with the pairedend 150 bp. The

110 highquality Illumina sequences generated from these two genomic libraries were applied to kmer

111 counting using SOAPec (v2.01) in the SOAP de novo software package. Then, based on the Kmer

112 analysis, information on the peak depth and the number of 17mers was obtained. Thus, the size of

113 the genome and heterozygosis can be estimated using the following formula: Genome size =

114 Kmernum / Peak depth relatively (Varshney et al. 2012). 115 Guanine plus cytosine (GC) content analysisDraft 116 The level of GC content is an important attribute of plant (and other living organisms) genomes.

117 The GC content is strictly controlled and moderately balanced across the genome (Parker et al.

118 2008). Kmer sizes of 20, 37, 55, 63, 71, 77, 83, and 95 were examined using default parameters,

119 and the optimal kmer size was selected based on the N50 length. The usable reads > 200 bases in

120 length were selected to realign the contig sequences because the sequences < 200 bp were likely

121 to be derived from repetitive or lowquality sequences (Lu et al. 2016). Finally, the GC average

122 sequencing depth was calculated by the 10kb nonoverlapping sliding windows along the

123 assembled sequence.

124 SSR identification

125 The Perl script MIcro SAtellite (MISA) was used to identify microsatellites in D. zingiberensis

126 genomes (Thiel et al. 2003). We used MISA scripting language

127 (http://pgrc.ipkgatersleben.de/misa/misa.html) with default parameters to identify SSR in our

6

https://mc06.manuscriptcentral.com/genome-pubs Page 7 of 34 Genome

128 sequence database. Through the analysis of genome sequences, six types of SSR can be identified:

129 mono, di, tri, tetra, penta, and hexa nucleotide SSR.

130 Gene prediction and annotation

131 The raw survey data and transcriptome data (Hua et al. 2017) were used to predict and annotate

132 genes. After filtering the scaffolds of < 1000 bp in size, Gensan, a software that identifies

133 complete exon / intron structures of genes in genomic DNA, was applied to the gene identification

134 with parameters trained on D. Zingiberensis (Burge and Karlin 1997). Additionally, TransDecoder

135 v2.0 (Haas et al. 2013) and GeneMarkST v5.1 (Tang et al. 2015) software were utilized to predict

136 genes according to the transcriptome database. Each predicted gene was annotated by BLAST 137 alignment to the GenBank database andDraft then analysed between predicted genes and common 138 databases such as plant Gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes

139 (KEGG), Eukaryotic clusters of Orthologous Groups (KOG), Nr, TrEMBL, and Clusters of

140 Orthologous Groups (COG). Meanwhile, the described genes were classified into the GO

141 categories and then mapped onto the KEGG reference pathways (Hirakawa et al. 2015).

142 Results

143 Genome size estimation by flow cytometry

144 The flow cytometric analyses yielded a highresolution histogram with CVs and mean values of

145 the tetraploid D. zingiberensis and the internal standard Salvia miltiorrhiza (Fig. 1). The CVs were

146 0.80% and 2.01% for D. zingiberensis and Salvia miltiorrhiza, respectively. The peak ratio was

147 calculated as 2.38, meaning the 1Cvalue of D. zingiberensis was 0.87 pg (1.32 × 0.66 pg = 0.87

148 pg).

149 Genome sequencing and sequence assembly

7

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 8 of 34

150 A total of 31.48 Gb sequence data were generated from the smallinsert (220 bp) library, with 91.1%

151 Q30 bases (base quality > 30), which was required for successful assembly, approximately 78.70×

152 coverage (Table 1). A large N50 contig and contig number might simply reflect a continuous and

153 complete assembly (Li et al. 2012). The 31.48 Gb clean reads were used to conduct de novo

154 assembly. Kmer sizes of 20, 37, 55, 63, 71, 77, 83, and 95 were examined using default

155 parameters. Assembly with kmer 63 by SOAP de novo was selected because it has the optimal

156 reading for N50 (Table 2), which is defined as a weighted median and is the smallest contig size in

157 the set with a combined length totalling 50% of the genome assembly (Carneiro et al. 2012; Earl et

158 al. 2011). The software SOAP de novo with kmer 63 produced a contig with the N50 of ~1.08 kb, 159 the longest contig length of ~19.83 kb,Draft and the total length of ~170.54 Mb (Table 2). A sequence 160 with a scaffold N50 length of ~1.96 kb, total length of ~173.46 Mb, and longest scaffold length of

161 ~40.16 kb (Table 2) was also generated. The total gap length (Ns) was ~2.91 Mb.

162 Genome size estimation

163 Based on the Kmer analysis, a total of 27.24 Gb (Table 3) clean data were used to count and plot

164 the distribution of 17mer frequency after filtering out the chloroplast sequencing data to estimate

165 the genome size of D. Zingiberensis. For the 17mer frequency distribution (Figure 1), the average

166 Kmer depth and the main peak of the depth was at ~57×. Likewise, the repeat peak was at the

167 position of the integer multiples of the main peak (~114×). The heterozygosis rate appeared at a

168 position of half of the height of the main peak (~28×), whereas the minor peak clearly appeared at

169 a position of a quarter of the height of the main peak (~15×). Thus, it was doubted to be

170 autotetraploid. According to the genetic background of this species, diploids and tetraploids occur

171 in nature (Huang et al. 2010). We deduced the sample to be autotetraploid. As a result, we

8

https://mc06.manuscriptcentral.com/genome-pubs Page 9 of 34 Genome

172 estimated the genome size to be 800.00 Mb, calculated by using the following algorithm: Genome

173 size = Kmernum / Peak depth. The genome size of repetitive sequences was approximately 42.81%

174 of the D. zingiberensis genome, which was estimated to be 342.48 Mb. The heterozygosity

175 indicates approximately 1.37% belonging to the complex genome of the higher heterozygosis rate.

176 Guanine plus cytosine (GC) content analysis

177 To measure the genomewide sequencing bias, the GC content and average sequencing depth were

178 plotted using nonoverlapping 10kb sliding windows along the assembled sequence (Figure 2).

179 The GC content of the genome varies in different plant species. A toohigh (>65%) and toolow

180 (<25%) GC content may cause sequence bias on the Illumina sequencing platform, thus seriously 181 affecting genome assembly (Aird et al.Draft 2011). The average GC content of the D. zingiberensis 182 genome was 39.12% (Table 2),which was higher than for Arabidopsis thaliana (36%) (Barakat et

183 al. 1998) and potatoes (34.836.0%) (Consortium et al. 2011; Hirakawa et al. 2015) but lower than

184 that of some marine macroalgae, such as Cyanidio schyzonmerolae (55.0%) (Ohta et al. 2003),

185 Solieria filiformis (48.6%) (Dalmon and Loiseaux 1981), Chondrus crispus (46.3%) (Gall et al.

186 1993), and Laminaria hyperborea (42.6%) (Stam et al. 1988). Therefore, the D. zingiberensis

187 genome was of midGC content. Moreover, the GC depth was slightly blocked into 4 layers

188 (Figure 2), which was in part caused by the polyploidy and the 1.37% high heterozygosity rate.

189 SSR identification

190 A total of 3,548,310 sequences were examined from the genome survey sequence containing

191 419,372 SSRs (Table 4). The mononucleotide repeats showed a predominant type, which

192 accounted for 54.60% of the observed SSRs, followed by the di (29.60%), tri (11.37%), tetra

193 (3.53%), penta (0.65%) and hexa (0.25%) nucleotide repeat types (Table 4). Mononucleotide

9

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 10 of 34

194 repeats have been reported to be the most common type of repeat whether in monocot species,

195 such as rice, sorghum, and Brachypodium, or in dicot species, for example, Arabidopsis, Medicago,

196 and Populus, which accounted for 79% in Medicago at most (Sonah et al. 2011).

197 In addition, 354 motif types were identified, comprising mono (4), di (8), tri (30), tetra (86),

198 penta (130), and hexa (96) nucleotide types (Table S1). Among the repeat motifs of the

199 dinucleotide, the TA/TA and AT/AT repeat were the most two abundant types, which accounted for

200 50.44% and 48.77%, respectively, followed by 19.21% CT/AG repeats (Figure 3). The

201 predominant motifs of the trinucleotide were ATT/AAT and TAA/TTA, accounting for 22.25% and

202 13.92%, respectively (Figure 4). 203 Gene prediction and annotation Draft 204 Based on the combination of genome sequencing and transcriptome data analysis of D.

205 zingiberensis, a total of 27,057 genes were predicted by Genescan and EVM (Altschul et al. 1990)

206 (Table 5). The average length of the putative genes identified was 911 bp, and the average exon

207 length and intron length were 544 bp and 236 bp, respectively. The putative genes were aligned by

208 Blast to the NR (Marchlerbauer et al. 2011), KOG (Tatusov et al. 2001), GO (Dimmer et al. 2012),

209 KEGG (Kanehisa and Goto 2000), and TrEMBL (B et al. 2003) databases, and 86.78% of the

210 putative genes were matched (Table 6).

211 GO annotations for the putative genes were obtained using the Blast2GO program (Conesa et al.

212 2005). Afterward, WEGO software (Ye et al. 2006) was applied to run GO functional

213 classifications for all genes and to understand the distribution of gene functions in D. zingiberensis

214 at the macro level. A total of 12,736 genes were identified by the GO slim analysis and further

215 classified into the categories of molecular function, cellular component, and biological process

10

https://mc06.manuscriptcentral.com/genome-pubs Page 11 of 34 Genome

216 (Figure 5). Specifically, 35.79%, 19.82%, and 44.39% of the genes were grouped under cellular

217 components, molecular functions, and biological processes, respectively. Furthermore, cell and cell

218 part (24.37% and 24.59%, respectively) were the most significantly represented groups within

219 cellular components; catalytic activity (47.18%) represented a relatively high proportion within

220 molecular functions; and metabolic process (20.66%) was the most highly represented group

221 within biological processes.

222 Altogether, 13,432 putative genes were classified into KOG functional categories. The largest

223 group was the cluster for general function prediction only (3,306; 24.61%), followed by signal

224 transduction mechanisms (1,567; 11.67%) and posttranslational modification, protein turnover, and 225 chaperones (1,269; 9.45%) (Figure S1).Draft 226 Pathway assignments were made according to KEGG mapping (Kanehisa and Goto 2000). There

227 were 7,406 putative genes assigned to 127 KEGG pathways (Table S3). A total of 4,473 genes

228 (60.40%) were associated with 96 metabolic pathways, in which 1,174 (26.25%) were involved in

229 carbohydrate metabolism; next was genetic information processing (1,945; 26.26%), with 382

230 associated with environmental information processing, 350 associated with cellular processes, and

231 251 associated with organismal systems.

232 Discussion

233 Genome size, also known as the genomic content or DNA 1Cvalue,refers to the DNA content of

234 the gamete genome. Genome size is the basis for comparative and evolutionary genomics (Hirano

235 and Das 2012). We can comparatively analyse the genome size of different species and detect,

236 recognize and grasp the regularity of the genome variation. Flow cytometry has been regarded as a

237 standard method for the prediction of the genome size of , such as Lessingianthus

11

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 12 of 34

238 (Vernonieae, Asteraceae) (Angulo and Dematteis 2013), Capsicum (Solanaceae) (Moscone et al.

239 2003), and Bemisiatabaci (Aleyrodidae) (Brown et al. 2005). In addition, Feulgen

240 spectrophotometry (Ha et al. 2007) and pulsed field gel electrophoresis (PFGE) (Lingohr et al.

241 2009; Zhang and Fan 2002) have been proven to be effective methods to detect the genome size.

242 However, the development of NGS technologies have provided researchers with a more efficient

243 and affordable approach of proposing a wide range of problems relating to nonmodel species.

244 Such a method has been applied to the analysis of the genomes of Brassica juncea (Yang et al.

245 2014), Myricarubra (Jiao et al. 2012), and Gracilariopsis lemaneiformis (Zhou et al. 2013).

246 Among the 200250 species in the Dioscorea genus, D. zingiberensis is the most important in 247 terms of its high content of dioscinDraft with medicinal value. However, its limited genomic 248 information has constrained the genetic studies on D. Zingiberensis. This article offers a brief

249 description of the genome size of D. Zingiberensis, providing the variation of genome size

250 references in the Dioscorea genus. This is the first report of genomewide characterization in the

251 Dioscorea genus, and it evaluated the heterozygosity rate, GC content and distribution of the

252 genome. The main conclusions of the study are as follows.

253 (1) The genome size, heterozygosity rate and GC content of D. zingiberensis is approximately 800

254 Mb, 1.37% and 39.12%, respectively, as estimated by the Kmer depth distribution of sequenced

255 reads. Compared with the other species in Dioscorea, the genome of D. Zingiberensis is relatively

256 small (Table 7). It is obvious that the genome size varied among different species in this genus. We

257 speculate the reason to be the varied extent of the amplification of repeat sequences occurring in

258 different species (Ohri and Khoshoo 1986; Wakamiya et al. 1993) and possible hybridization

259 between closely related taxa (Hall et al. 2000). The most likely explanation is the frequent

12

https://mc06.manuscriptcentral.com/genome-pubs Page 13 of 34 Genome

260 polyploidization events pushing the genome size variation during its evolution (Tian et al. 2008).

261 Furthermore, the result of the flow cytometric is consistent with the genome survey.

262 Arabidopsis has been documented as a model organism for genetic study, mainly because it has a

263 small genome (120 Mb) that is amenable to detailed molecular analysis (Meinke et al. 1998;

264 Meyerowitz and Pruitt 1985). In marine macroalgae, Pyropia was regarded as a model organism

265 for genetic studies, partly because its haploid has a relatively small genome size (270–530 Mb)

266 (YANG et al. 2011). The estimated genome size of D. zingiberensis (800 Mb) is much smaller than

267 other Dioscorea species, which shows the potential of D. zingiberensis as a model species in this

268 regard. 269 (2) The genome survey identified a totalDraft of 419,372 SSR from the D. zingiberensis genome. SSRs 270 in plant genomes have been surveyed in many species, and the numbers were quite different

271 among these sequenced plants, such as Oryza sativa (70,531), Arabidopsis thaliana (15,249) and

272 Sorghum bicolor (73,658) (Sonah et al., 2011). The SSR number in D. zingiberensis was nearly six

273 times greater than that in Oryza sativa. The frequency of each motif of the 354 polymorphic SSRs

274 is presented in Table S1. The TA/TA repeat was the most prominent type, accounting for 50.44%.

275 It is possible that the 419,372 derived SSR loci found in our study may be used as SSR markers for

276 genetic mapping in the short term.

277 (3) The number of genes predicted by the genome survey only of D. zingiberensis was much lower

278 than that of other sequenced genomes such as Rosa roxburghii Trat (Lu et al. 2016), Prunusmume

279 (Zhang et al. 2012), and Prunuspersica (Verde et al. 2013). The reason should be the insufficient

280 sequence depth coverage and low sequence homology due to limited gene information from

281 closely related species (Zhou et al. 2013). However, after combining with the transcriptome data,

13

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 14 of 34

282 the number of annotated genes was increased significantly.

283 (4) The wholegenome shotgun (WGS) strategy is relatively difficult to assemble (Chen and

284 Pachter 2005). Instead, the complicated bacterial artificial chromosome (BAC) strategy could

285 resolve problems associated with the assembly of a heterozygous genome (Wu et al. 2013).

286 However, to ensure the genome integrity of the assembly, homozygous materials for genome

287 sequencing were always a priority (Shulaev et al. 2011; Xu et al. 2013).

288

289

290 Draft

14

https://mc06.manuscriptcentral.com/genome-pubs Page 15 of 34 Genome

291 Author Contributions

292 Conceived and designed the experiments: WZ BL. Performed the experiments: YCL WM.

293 Analysed the data: WZ LL. Contributed reagents/materials/analysis tools: SCF YCL. Wrote the

294 paper: WZ BL.

295

296 Data Availability Statement: The genome sequence reads obtained by Illumina Hiseq 2500 are

297 available at NCBISRA. The Bioproject accession number is PRJNA391240

298 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA391240), and the Biosample accession number is

299 SAMN07259770 (https://www.ncbi.nlm.nih.gov/biosample/SAMN07259770). The Experiment 300 number is SRX3235157/Dioscorea zingiberensisDraft and Run number is SRR6122503. 301

302 Funding: This work was supported by the Fundamental Research Funds for the Central

303 Universities (2017CSZ008) and the National Natural Science Foundation of China (31670299).

15

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 16 of 34

304 Reference

305 Aird, D., Ross, M.G., Chen, W.S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D.B., Nusbaum, C., and 306 Gnirke, A. 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. 307 Genome Biology 12(2): 114. 308 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. 309 Journal of Molecular Biology 215(3): 403410. 310 Angulo, M.B., and Dematteis, M. 2013. Nuclear DNA content in some species of Lessingianthus 311 (Vernonieae, Asteraceae) by flow cytometry. J Plant Res 126(4): 461468. doi: 10.1007/s102650120539x. 312 Arumuganathan, K., and Earle, E.D. 1991. Nuclear DNA content of some important plant species. Plant 313 Molecular Biology Reporter 9(3): 208218. 314 B, B., A, B., R, A., MC, B., A, E., E, G., MJ, M., K, M., C, O.D., I, P., S, P., and M, S. 2003. The SwissProt 315 knowledgebase and its supplement TREMBL in 2003. Nucleic Acids Research 31(1): 365. 316 Bai, Y., Zhang, L., Jin, W., Wei, M., Zhou, P., Zheng, G., Niu, L., Nie, L., Zhang, Y., and Wang, H. 2015. In 317 situ highvalued utilization and transformation of sugars from Dioscorea zingiberensis C.H. Wright for clean 318 production of diosgenin. Bioresource Technology 196: 642. 319 Barakat, A., Matassi, G., and Bernardi, G. 1998. Distribution of Genes in the Genome of Arabidopsis 320 thaliana and Its Implications for the Genome Organization of Plants. Proceedings of the National Academy 321 of Sciences of the United States of America 95(17): 1004410049. 322 Bertrand, J., Liagre, B., BégaudGrimaud, G., Jauberteau, M.O., Beneytout, J.L., Cardot, P.J.P., and Battu, S. 323 2009. Analysis of relationship between Draft cell cycle stage and apoptosis induction in K562 cells by 324 sedimentation fieldflow fractionation. Journal of Chromatography B Analytical Technologies in the 325 Biomedical & Life Sciences 877(1112): 1155. 326 Bharathan, G., Lambert, G., and Galbraith, D.W. 1994. Nuclear DNA Content of and 327 Related Taxa. American Journal of Botany 81(3): 381386. 328 Brown, J.K., Lambert, G.M., Ghanim, M., Czosnek, H., and Galbraith, D.W. 2005. Nuclear DNA content of 329 the whitefly Bemisia tabaci (Aleyrodidae: Hemiptera) estimated by flow cytometry. Bulletin of 330 Entomological Research 95(4): 309312. 331 Burge, C., and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. Journal of 332 Molecular Biology 268(1): 7894. 333 Carneiro, A.R., Ramos, R.T., Barbosa, H.P., Schneider, M.P., Barh, D., Azevedo, V., and Silva, A. 2012. 334 Quality of prokaryote genome assembly: Indispensable issues of factors affecting prokaryote genome 335 assembly quality. Gene 505(2): 365. 336 Chen, K., and Pachter, L. 2005. Bioinformatics for WholeGenome Shotgun Sequencing of Microbial 337 Communities. PLoS Comput. Biol. 1(2): 106. 338 Chen, S., Xu, J., Liu, C., Zhu, Y., Nelson, D.R., Zhou, S., Li, C., Wang, L., Guo, X., and Sun, Y. 2012. 339 Genome sequence of the model medicinal mushroom Ganoderma lucidum. Nature Communications 3(2): 340 913. 341 Conesa, A., Götz, S., Garcíagómez, J.M., Terol, J., Talón, M., and Robles, M. 2005. Blast2GO: a universal 342 tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18): 343 36743676. 344 Consortium, P.G.S., Xu, X., Pan, S., Cheng, S., Zhang, B., Mu, D., Ni, P., Zhang, G., Yang, S., and Li, R. 345 2011. Genome sequence and analysis of tuber crop potato. Nature 475(7355): 189195. 346 Dalmon, J., and Loiseaux, S. 1981. The deoxyribonucleic acids of two brown algae: Pylaiella littoralis (L.) 347 Kjellm. and Sphacellaria sp. Plant Science Letters 21(3): 241251. 16

https://mc06.manuscriptcentral.com/genome-pubs Page 17 of 34 Genome

348 Dimmer, E.C., Huntley, R.P., AlamFaruque, Y., Sawford, T., O'Donovan, C., Martin, M.J., Bely, B., Browne, 349 P., Chan, W.M., Eberhardt, R., Gardner, M., Laiho, K., Legge, D., Magrane, M., Pichler, K., Poggioli, D., 350 Sehra, H., Auchincloss, A., Axelsen, K., Blatter, M.C., Boutet, E., BraconiQuintaje, S., Breuza, L., Bridge, 351 A., Coudert, E., Estreicher, A., Famiglietti, L., FerroRojas, S., Feuermann, M., Gos, A., GruazGumowski, 352 N., Hinz, U., Hulo, C., James, J., Jimenez, S., Jungo, F., Keller, G., Lemercier, P., Lieberherr, D., Masson, P., 353 Moinat, M., Pedruzzi, I., Poux, S., Rivoire, C., Roechert, B., Schneider, M., Stutz, A., Sundaram, S., 354 Tognolli, M., Bougueleret, L., ArgoudPuy, G., Cusin, I., DuekRoggli, P., Xenarios, I., and Apweiler, R. 355 2012. The UniProtGO Annotation database in 2011. Nucleic Acids Research 40(D1): D565D570. doi: 356 10.1093/nar/gkr1048. 357 Dpooležel, J., Binarová, P., and Lcretti, S. 1989. Analysis of Nuclear DNA content in plant cells by Flow 358 cytometry. Biologia Plantarum 31(2): 113120. 359 Earl, D., Bradnam, K., St, J.J., Darling, A., Lin, D., Fass, J., Yu, H.O., Buffalo, V., Zerbino, D.R., and 360 Diekhans, M. 2011. Assemblathon 1: a competitive assessment of de novo short read assembly methods. 361 Genome Research 21(12): 2224. 362 Gall, Y.L., Brown, S., Marie, D., Mejjad, M., and Kloareg, B. 1993. Quantification of nuclear DNA and GC 363 content in marine macroalgae by flow cytometry of isolated nuclei. Protoplasma 173(3): 123132. 364 Ha, S.H., Kim, J.B., Park, J.S., Lee, S.W., and Cho, K.J. 2007. A comparison of the carotenoid accumulation 365 in Capsicum varieties that show different ripening colours: deletion of the capsanthincapsorubin synthase 366 gene is not a prerequisite for the formation of a yellow pepper. 367 Haas, B.J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P.D., Bowden, J., Couger, M.B., Eccles, D., 368 Li, B., and Lieber, M. 2013. De novo transcriptDraft sequence reconstruction from RNAseq using the Trinity 369 platform for reference generation and analysis. Nature Protocols 8(8): 14941512. 370 Hall, S.E., Dvorak, W.S., Johnston, J.S., Price, H.J., and Williams, C.G. 2000. Flow Cytometric Analysis of 371 DNA Content for Tropical and Temperate New World Pines. Annals of Botany 86(6): 10811086. 372 Hamon, P., Brizard, J.P., Zoundjihékpon, J., Duperray, C., and Borgel, A. 2011. Étude des index d'ADN de 373 huit espèces d'ignames (Dioscorea sp.) par cy. Canadian Journal of Botany 70(5): 9961000. 374 Hirakawa, H., Okada, Y., Tabuchi, H., Shirasawa, K., Watanabe, A., Tsuruoka, H., Minami, C., Nakayama, 375 S., Sasamoto, S., and Kohara, M. 2015. Survey of genome sequences in a wild sweet potato, Ipomoea trifida 376 (H. B. K.) G. Don. DNA Research 22(2): 171179. 377 Hirano, M., and Das, S. 2012. Editorial [Hot Topic: Comparative Genomics and Genome Evolution (Guest 378 Editors: Sabyasachi Das and Masayuki Hirano)]. Current Genomics 13(2): . 379 Hou, S., Sun, Z., Linghu, B., Xu, D., Wu, B., Zhang, B., Wang, X., Han, Y., Zhang, L., and Qiao, Z. 2016. 380 Genetic Diversity of Buckwheat Cultivars ( Fagopyrum tartaricum Gaertn.) Assessed with SSR Markers 381 Developed from Genome Survey Sequences. Plant Molecular Biology Reporter 34(1): 233241. 382 Hua, W., Kong, W., Cao, X.Y., Chen, C., Liu, Q., Li, X., and Wang, Z. 2017. Transcriptome analysis of 383 Dioscorea zingiberensis identifies genes involved in diosgenin biosynthesis. Genes & Genomics 39(5): 112. 384 Huang, H.P., Gao, S.L., Chen, L.L., and Wei, K.H. 2010. In vitro tetraploid induction and generation of 385 tetraploids from mixoploids in Dioscorea zingiberensis. Pharmacognosy Magazine 6(21): 5156. 386 Jiao, Y., Jia, H.M., Li, X.W., Chai, M.L., Jia, H.J., Chen, Z., Wang, G.Y., Chai, C.Y., Weg, E.V.D., and Gao, 387 Z.S. 2012. Development of simple sequence repeat (SSR) markers from a genome survey of Chinese 388 bayberry ( Myrica rubra ). BMC Genomics 13(1): 201. 389 Kanehisa, M., and Goto, S. 2000. KEGG: Kyoto Encyclopaedia of Genes and Genomes. Nucleic Acids 390 Research volume 28(1): 2730(24). 391 Li, D., Zhi, D., Bi, Q., Liu, X., Men, and Zhonghua. 2012. De novo assembly and characterization of bark 392 transcriptome using Illumina sequencing and development of ESTSSR markers in rubber tree (Hevea 17

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 18 of 34

393 brasiliensis Muell. Arg.). BMC Genomics 13(1): 192. 394 Li, H., Huang, W., Wen, Y., Gong, G., Zhao, Q., and Yu, G. 2010a. Antithrombotic activity and chemical 395 characterization of steroidal saponins from Dioscorea zingiberensis C.H. Wright. Fitoterapia 81(8): 396 11471156. doi: 10.1016/j.fitote.2010.07.016. 397 Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., 398 Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., 399 Ryder, O.A., Leung, F.C., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., 400 Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., 401 Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., 402 Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C.C., Lam, T.T., Lin, S., Zhang, Q., Li, G., 403 Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., 404 Bruford, M.W., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., 405 Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T.W., Yiu, S.M., 406 Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, 407 L., Kristiansen, K., Wong, G.K., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., and Wang, J. 2010b. The 408 sequence and de novo assembly of the giant panda genome. Nature 463(7279): 311317. doi: 409 10.1038/nature08696. 410 Li, X., and Waterman, M.S. 2003. Estimating the Repeat Structure and Length of DNA Sequences Using 411 ℓTuples. Genome Research 13(8): 1916. 412 Liang, Xiao, Wang, Yang, Tian, Jinmin, Lian, Ruijuan, Yang, and Shumei. 2015. The Genome of 413 Dendrobium officinale Illuminates the BiologyDraft of the Important Traditional Chinese Orchid Herb. 分子植 414 物(英文版) 8(6): 922934. 415 Lingohr, E., Frost, S., and Johnson, R.P. 2009. Determination of Bacteriophage Genome Size by 416 PulsedField Gel Electrophoresis. Humana Press. 417 Lu, M., An, H., and Li, L. 2016. Genome Survey Sequencing for the Characterization of the Genetic 418 Background of Rosa roxburghii Tratt and Ascorbate Metabolism Genes. Plos One 11(2): e0147530. 419 Marchlerbauer, A., Lu, S., Anderson, J.B., Chitsaz, F., Derbyshire, M.K., Deweesescott, C., Fong, J.H., Geer, 420 L.Y., Geer, R.C., and Gonzales, N.R. 2011. CDD: a Conserved Domain Database for the functional 421 annotation of proteins. Nucleic Acids Research 39(Database issue): D225. 422 Meinke, D.W., Cherry, J.M., Dean, C., Rounsley, S.D., and Koornneef, M. 1998. Arabidopsis thaliana: a 423 model plant for genome analysis. Science 282(5389): 679682. 424 Meyerowitz, E.M., and Pruitt, R.E. 1985. Arabidopsis thaliana and Plant Molecular Genetics. Science 425 229(4719): 12141218. doi: 10.1126/science.229.4719.1214. 426 Minato, D., Li, B., Zhou, D., Shigeta, Y., Toyooka, N., Sakurai, H., Sugimoto, K., Nemoto, H., and Matsuya, 427 Y. 2013. Synthesis and antitumor activity of desAB analogue of steroidal saponin OSW1. Tetrahedron 428 69(37): 80198024. doi: 10.1016/j.tet.2013.06.105. 429 Moscone, E.A., Baranyi, M., Ebert, I., Greilhuber, J., Ehrendorfer, F., and Hunziker, A.T. 2003. Analysis of 430 Nuclear DNA Content in Capsicum (Solanaceae) by Flow Cytometry and Feulgen Densitometry. Annals of 431 Botany 92(1): 21. 432 Obidiegwu, J.E., Rodriguez, E., EneObong, E.E., Loureiro, J., Muoneke, C.O., Santos, C., 433 KolesnikovaAllen, M., and Asiedu, R. 2009. Estimation of the nuclear DNA content in some representative 434 of genus Dioscorea. Scientific Research & Essays 4(5): 448452. 435 Ohri, D., and Khoshoo, T.N. 1986. Genome size in gymnosperms. Plant Systematics and Evolution 153(1): 436 119132. 437 Ohta, N., Matsuzaki, M., Misumi, O., Miyagishima, S.Y., Nozaki, H., Kan, T., ShinI, T., Kohara, Y., and 18

https://mc06.manuscriptcentral.com/genome-pubs Page 19 of 34 Genome

438 Kuroiwa, T. 2003. Complete Sequence and Analysis of the Plastid Genome of the Unicellular Red Alga 439 Cyanidioschyzon merolae. DNA Research 10(2): 67. 440 Parker, S.C.J., Margulies, E.H., and Tullius, T.D. 2008. THE RELATIONSHIP BETWEEN FINE SCALE 441 DNA STRUCTURE, GC CONTENT, AND FUNCTIONAL ELEMENTS IN 1% OF THE HUMAN 442 GENOME. In. p. 199. 443 Qin, Y., Wu, X., Huang, W., Gong, G., Li, D., He, Y., and Zhao, Y. 2009. Acute toxicity and subchronic 444 toxicity of steroidal saponins from Dioscorea zingiberensis C.H.Wright in rodents. Journal of 445 Ethnopharmacology 126(3): 543550. 446 Shulaev, V., Sargent, D.J., Crowhurst, R.N., Mockler, T.C., Folkerts, O., Delcher, A.L., Jaiswal, P., Mockaitis, 447 K., Liston, A., Mane, S.P., Burns, P., Davis, T.M., Slovin, J.P., Bassil, N., Hellens, R.P., Evans, C., Harkins, 448 T., Kodira, C., Desany, B., Crasta, O.R., Jensen, R.V., Allan, A.C., Michael, T.P., Setubal, J.C., Celton, J.M., 449 Rees, D.J.G., Williams, K.P., Holt, S.H., Rojas, J.J.R., Chatterjee, M., Liu, B., Silva, H., Meisel, L., Adato, 450 A., Filichkin, S.A., Troggio, M., Viola, R., Ashman, T.L., Wang, H., Dharmawardhana, P., Elser, J., Raja, R., 451 Priest, H.D., Bryant, D.W., Fox, S.E., Givan, S.A., Wilhelm, L.J., Naithani, S., Christoffels, A., Salama, D.Y., 452 Carter, J., Girona, E.L., Zdepski, A., Wang, W.Q., Kerstetter, R.A., Schwab, W., Korban, S.S., Davik, J., 453 Monfort, A., DenoyesRothan, B., Arus, P., Mittler, R., Flinn, B., Aharoni, A., Bennetzen, J.L., Salzberg, 454 S.L., Dickerman, A.W., Velasco, R., Borodovsky, M., Veilleux, R.E., and Folta, K.M. 2011. The genome of 455 woodland strawberry (Fragaria vesca). Nat Genet 43(2): 109116. doi: 10.1038/ng.740. 456 Sonah, H., Deshmukh, R.K., Sharma, A., Singh, V.P., Gupta, D.K., Gacche, R.N., Rana, J.C., Singh, N.K., 457 and Sharma, T.R. 2011. GenomeWide Distribution and Organization of Microsatellites in Plants: An Insight 458 into Marker Development in Brachypodium.Draft Plos One 6(6). doi: ARTN e21298 459 10.1371/journal.pone.0021298. 460 Stam, W.T., Bot, P.V.M., BoeleBos, S.A., Rooij, J.M.V., and Hoek, C.V.D. 1988. Singlecopy DNADNA 461 hybridizations among five species ofLaminaria (Phaeophyceae): Phylogenetic and biogeographic 462 implications. Helgoland Marine Research 42(2): 251267. 463 Tang, S., Lomsadze, A., and Borodovsky, M. 2015. Identification of protein coding regions in RNA 464 transcripts. Nucleic Acids Research 43(12): e78. 465 Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., 466 Galperin, M.Y., Fedorova, N.D., and Koonin, E.V. 2001. The COG database: new developments in 467 phylogenetic classification of proteins from complete genomes. Nucleic Acids Research 29(1): 22. 468 Thiel, T., Michalek, W., Varshney, R.K., and Graner, A. 2003. Exploiting EST databases for the development 469 and characterization of genederived SSRmarkers in barley (Hordeum vulgare L.). Theoretical and Applied 470 Genetics 106(3): 411422. 471 Tian, M., JiYuan, L.I., Sui, N.I., Fan, Z.Q., XinLei, and Li. 2008. Phylogenetic Study on Section Camellia 472 Based on ITS Sequences Data. Acta Horticulturae Sinica 35(11): 16851688. 473 Varshney, R.K., Chen, W., Li, Y., Bharti, A.K., Saxena, R.K., Schlueter, J.A., Donoghue, M.T., Azam, S., 474 Fan, G., and Whaley, A.M. 2012. Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume 475 crop of resourcepoor farmers. Nature Biotechnology 30(1): 8389. 476 Verde, I., Abbott, A.G., Scalabrin, S., Jung, S., Shu, S., Marroni, F., Zhebentyayeva, T., Dettori, M.T., and 477 Grimwood, J. 2013. The highquality draft genome of peach (Prunus persica) identifies unique patterns of 478 genetic diversity, domestication and genome evolution. Nat Genet 45(5): 487494. 479 Veselý, P., Bureš, P., Šmarda, P., and Pavlíček, T. 2012. Genome size and DNA base composition of 480 geophytes: the mirror of phenology and ecology? Annals of botany 109(1): 65. 481 Wakamiya, I., Newton, R.J., Johnston, J.S., and Price, H.J. 1993. Genome Size and Environmental Factors in 482 the Genus Pinus. American Journal of Botany 80(11): 12351241. 19

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 20 of 34

483 Wang, Y., Zhang, Y., Zhu, Z., Zhu, S., Li, Y., Li, M., and Yu, B. 2007. Exploration of the correlation between 484 the structure, hemolytic activity, and cytotoxicity of steroid saponins. Bioorganic & Medicinal Chemistry 485 15(7): 2528. 486 Wenping, H., Yuan, Z., Jie, S., Lijun, Z., and Zhezhi, W. 2011. De novo transcriptome sequencing in Salvia 487 miltiorrhiza to identify genes involved in the biosynthesis of active ingredients. Genomics 98(4): 272. 488 Wu, J., Wang, Z., Shi, Z., Zhang, S., Ming, R., Zhu, S., Khan, M.A., Tao, S., Korban, S.S., and Wang, H. 489 2013. The genome of the pear (Pyrus bretschneideri Rehd.). Genome Research 23(2): 396. 490 Xu, Q., Chen, L.L., Ruan, X., Chen, D., Zhu, A., Chen, C., Bertrand, D., Jiao, W.B., Hao, B.H., and Lyon, 491 M.P. 2013. The draft genome of sweet orange (Citrus sinensis). Nat Genet 45(1): 59. 492 YANG, Hui, and MAO. 2011. Profiling of the transcriptome of Porphyra yezoensis with Solexa sequencing 493 technology. Science Bulletin 56(20): 21192130. 494 Yang, J., Ning, S., Xuan, Z., Qi, X., Hu, Z., and Zhang, M. 2014. Genome survey sequencing provides clues 495 into glucosinolate biosynthesis and flowering pathway evolution in allotetrapolyploid Brassica juncea. BMC 496 Genomics 15(1): 107. 497 Ye, J., Fang, L., Zheng, H., Zhang, Y., Chen, J., Zhang, Z., Wang, J., Li, S., Li, R., and Bolund, L. 2006. 498 WEGO: a web tool for plotting GO annotations. Nucleic Acids Research 34(Web Server issue): 293297. 499 Zhang, G., Yang, T., Jing, Z., Shu, L., Yang, S., Wen, W., Sheng, J., Yang, D., and Wei, C. 2015. Hybrid de 500 novo genome assembly of the Chinese herbal plant danshen ( Salvia miltiorrhiza Bunge). 501 GigaScience,4,1(20151214) 4(1): 62. 502 Zhang, J.Z., and Fan, M.Y. 2002. Determination of genome size and restriction fragment length 503 polymorphism of four Chinese rickettsial isolatesDraft by pulsedfield gel electrophoresis. Acta Virologica 46(1): 504 2530. 505 Zhang, Q., Chen, W., Sun, L., Zhao, F., Huang, B., Yang, W., Tao, Y., Wang, J., Yuan, Z., and Fan, G. 2012. 506 The genome ofPrunus mume. Nature Communications 3(4): 1318. 507 Zhou, W., Hu, Y., Sui, Z., Fu, F., Wang, J., Chang, L., Guo, W., and Li, B. 2013. Genome Survey Sequencing 508 and Genetic Background Characterization of Gracilariopsis lemaneiformis (Rhodophyta) Based on 509 NextGeneration Sequencing. Plos One 8(7): e69909. 510 Zonneveld, B.J., Leitch, I.J., and Bennett, M.D. 2005. First nuclear DNA amounts in more than 300 511 angiosperms. Annals of botany 96(2): 229244.

512

20

https://mc06.manuscriptcentral.com/genome-pubs Page 21 of 34 Genome

513 Figure 1. The analysis of D. zingiberensis genome size of by flow cytometry.

514 Figure 2. Distribution of 17mer frequency for estimating the genome size of

515 D. zingiberensis.

516 Figure 3. GC content and average sequencing depth of the genome data used

517 for assembly (the xaxis was the GC content percent across every 10kb

518 nonoverlapping sliding window).

519 Figure 4. Percentage of different motifs in dinucleotide repeats in D.

520 Zingiberensis.

521 Figure 5. Percentage of different motifs in trinucleotide repeats in D. 522 Zingiberensis. Draft 523 Figure 6. Gene Ontology classification. Genes were assigned to three

524 categories: cellular components, molecular functions, and biological

525 processes.

526

527 Supporting Information

528 1. Table S1. Occurrence of SSR motifs in Genome Survey to D.

529 Zingiberensis.(XLS)

530 2. Figure S2. Gene assignment to KOG functional categories in D.

531 Zingiberensis. (TIF)

532 3. Table S3. Number of genes mapped onto KEGG pathways. (XLS)

533

21

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 22 of 34

534 Table 1. Statistics of sequencing data. 535 Library (bp) Data (Gb) Depth (X) Q20 (%) Q30 (%) 220 31.48 78.7 95.08 91.1

536

537

Draft

22

https://mc06.manuscriptcentral.com/genome-pubs Page 23 of 34 Genome

538 Table 2. Statistics of the assembled genome sequences. 539 Contigs Number of sequences 334,288 Total length (bases) 174,634,115 N50 length (bases) 1,079 N90 length (bases) 219 Max length (bases) 19,831 GC content (%) 39.83 Scaffolds Number of sequences 3,548,310 Total length (bases) 177,618,332 N50 length (bases) 1,955 N90 length (bases) 1,110 Max length (bases) 40,163 A 281,503,836 T 271,128,258 G 174,939,581 C Draft 180,143,770 N 10,628,111 Total (ATGC) 907,715,445 G+C% (ATGC) 39.12

540

541

23

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 24 of 34

542 Table 3. Statistics of sequencing data after filtering out the chloroplast data 543 Library (bp) Data (Gb) Depth (X) Q20 (%) Q30 (%) 220 27.24 68.10 94.92 90.87

544

545

Draft

24

https://mc06.manuscriptcentral.com/genome-pubs Page 25 of 34 Genome

546 Table 4. Simple sequence repeat types detected in the D. zingiberensis sequences. 547 Searching Item Number Ratio Total number of sequences examined 3,548,310 Total size of examined sequences (bp) 918,343,556 Total number of identified SSRs 419,372 100% Number of SSR containing sequences 353,988 84.41% Number of sequences containing more than 1 SSR 50,592 12.06% Number of SSRs present in compound formation 33,439 7.97% Mono nucleotide 228,973 54.60% Di nucleotide 124,133 29.60% Tri nucleotide 47,681 11.37% Tetra nucleotide 14,815 3.53% Penta nucleotide 2,726 0.65% Hexa nucleotide 1,044 0.25%

548

549 Draft

25

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 26 of 34

550 Table 5. Statistics on gene information. 551 Software Gene Gene Average Exon Average Intron Average number gene exon intron Length (bp) EVM 27,057 24,659,071 911.37 14,722,7 544.14 6,388,8 236.12 44 07

552

553

Draft

26

https://mc06.manuscriptcentral.com/genome-pubs Page 27 of 34 Genome

554 Table 6. Statistics of gene functional annotation. 555 Annotation database Annotated number Percentage (%) GO 12,736 47.07 KOG 13,432 49.64 KEGG 8,395 31.03 NR 22,306 82.44 TrEMBL 22,024 81.40 All Annotated 2,915 86.78

556

557

Draft

27

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 28 of 34

558 Table 7. The genome size of the species in Dioscorea genus

Species Genome size(Mb) Original Reference

Dioscorea dumetorum 831.3 (Obidiegwu et al. 2009)

Dioscorea tokoro 430.3 (Veselý et al. 2012)

Dioscorea togoensis 469.4 (Hamon et al. 2011)

Dioscorea alata 567.2 (Arumuganathan and Earle 1991)

Dioscorea abyssinica 616.1 (Hamon et al. 2011)

Dioscorea mangenotiana 616.1 (Hamon et al. 2011)

Dioscorea praehensilis 616.1 (Hamon et al. 2011)

Dioscorearotundata 694.3Draft (Obidiegwu et al. 2009)

Dioscorea cayenensis 753.1 (Obidiegwu et al. 2009)

Dioscorea sylvatica 831.3 (Bharathan et al. 1994)

Dioscorea esculenta 1026.9 (Obidiegwu et al. 2009)

Dioscorea bulbifera 1173.6 (Obidiegwu et al. 2009)

Dioscorea villosa 2347.2 (Bharathan et al. 1994)

Dioscorea elephantipes 6601.5 (Zonneveld et al. 2005)

559

28

https://mc06.manuscriptcentral.com/genome-pubs Page 29 of 34 Genome

Draft

Figure 1. The analysis of D. zingiberensis genome size of by flow cytometry.

76x78mm (300 x 300 DPI)

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 30 of 34

Draft

Figure 2. Distribution of 17-mer frequency for estimating the genome size of D. zingiberensis.

67x57mm (300 x 300 DPI)

https://mc06.manuscriptcentral.com/genome-pubs Page 31 of 34 Genome

Draft

Figure 3. GC content and average sequencing depth of the genome data used for assembly (the x-axis was the GC content percent across every 10-kb non-overlapping sliding window).

67x53mm (300 x 300 DPI)

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 32 of 34

Draft

Figure 4. Percentage of different motifs in dinucleotide repeats in D. Zingiberensis.

72x52mm (300 x 300 DPI)

https://mc06.manuscriptcentral.com/genome-pubs Page 33 of 34 Genome

Figure 5. Percentage of differentDraft motifs in trinucleotide repeats in D. Zingiberensis.

126x77mm (300 x 300 DPI)

https://mc06.manuscriptcentral.com/genome-pubs Genome Page 34 of 34

Draft

Figure 6. Gene Ontology classification. Genes were assigned to three categories: cellular components, molecular functions, and biological processes.

https://mc06.manuscriptcentral.com/genome-pubs