Genome
Genome Survey Sequencing of Dioscorea zingiberensis
Journal: Genome
Manuscript ID gen-2018-0011.R2
Manuscript Type: Article
Date Submitted by the Author: 22-May-2018
Complete List of Authors: Zhou, Wen; Shaanxi Normal University, Life Science School Li, Bin; Shaanxi Normal University, Life Science School Li, Lin; Shaanxi Normal University, Life Science School Ma, wen; Shaanxi Normal University, Life Science School Liu, Yuanchu;Draft Shaanxi Normal University, Life Science School Feng, Shuchao; Shaanxi Normal University, Life Science School Wang, Zhezhi; Shaanxi Normal University, Life Science School
Keyword: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis
Is the invited manuscript for consideration in a Special N/A Issue? :
https://mc06.manuscriptcentral.com/genome-pubs Page 1 of 34 Genome
1 Genome survey sequencing of Dioscorea zingiberensis
2
3 Wen Zhou+; Bin Li+; Lin Li; Wen Ma; Yuanchu Liu; Shuchao Feng; Zhezhi Wang*
4
5 1 Key Laboratory of the Ministry of Education for Medicinal Resources and Natural
6 Pharmaceutical Chemistry, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China
7 2 National Engineering Laboratory for Resource Development of Endangered Chinese Crude
8 Drugs in Northwest China, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China
9 10 + These authors contributed equally to Draftthis work. 11
12 *Correspondence: Prof. Zhe Zhi WANG; [email protected]; Tel.: +86 29 85310260
1
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 2 of 34
13 Abstract
14 Dioscorea zingiberensis (Dioscoreceae) is the main plant source of diosgenin (steroidal
15 sapogenins), the precursor for the production of steroid hormones in the pharmaceutical industry.
16 Despite its large economic value, genomic information of this Dioscorea genus is currently
17 unavailable. Here, we present an initial survey of the D. zingiberensis genome performed by
18 next generation sequencing technology together with a genome size investigation inferred by flow
19 cytometry. The whole genome survey of D. zingiberensis generated 31.48 Gb of sequence data
20 with approximately 78.70× coverage. The estimated genome size is 800 Mb, with a high level of
21 heterozygosity based on K mer analysis. These reads were assembled into 334,288 contigs with a 22 N50 length of 1,079 bp, which were furtherDraft assembled into 92,163 scaffolds with a total length of 23 173.46 Mb. A total of 4935 genes, 81 tRNAs, 69 rRNAs, and 661 miRNAs were predicted by the
24 genome analysis, and 263,484 repeated sequences were obtains with 419,372 simple sequence
25 repeats (SSRs). Among these SSRs, the mononucleotide repeat type was the most abundant (up to
26 54.60% of the total SSRs), followed by the dinucleotide (29.60%), trinucleotide (11.37%),
27 tetranucleotide (3.53%), pentanucleotide (0.65%), and hexanucleotide (0.25%) nucleotide repeat
28 types. The 1C value of D. zingiberensis was calibrated against Salvia miltiorrhiza and calculated
29 as 0.87 pg (851 Mb) by flow cytometry, which was very close to the result of the genome survey.
30 This is the first report of genome wide characterization within this taxon.
31 Key Words: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis
2
https://mc06.manuscriptcentral.com/genome-pubs Page 3 of 34 Genome
32 Introduction
33 D. zingiberensis is an important and widely used medicinal herb in Traditional Chinese Medicine
34 (TCM). It has been applied for the treatment of various diseases, such as cough, anthrax,
35 rheumatoid arthritis, and sprains as well as cardiac diseases (Li et al. 2010a; Qin et al. 2009).
36 Plentiful diosgenin, a type of steroidal saponin extracted from the rhizomes of D. zingiberensis, is
37 an important steroidal precursor used in the pharmaceutical industry. In the medical industry,
38 diosgenin is widely used as the starting material for the synthesis of many steroidal drugs (e.g.,
39 antioxidants, anti inflammatories, androgen, oestrogen, and contraceptives) due to the similarity in
40 their skeletons (Bertrand et al. 2009; Wang et al. 2007). More importantly, steroidal sapogenins 41 are attractive to many synthetic and medicinalDraft chemists aiming to harness their anticancer activity 42 (Minato et al. 2013). As the demand of the global market increases at 8% annually, steroid
43 hormones such as sexual hormones, cortical hormones, and protein anabolic hormones call for a
44 matching supply of the precursor to be produced (Bai et al. 2015).
45 At present, the extraction process of diosgenin from D. zingiberensis usually generates plenty of
46 high acid and high strength wastewater, which cannot be ignored as a great threat to the
47 environment. In consideration of this, microorganism bioengineering is an effective method for
48 producing diosgenin. However, genetic studies on D. zingiberensis remain underdeveloped
49 compared with many other herbs, such as Salvia miltiorrhiza (Wenping et al. 2011), Dendrobium
50 officinale (Liang et al. 2015), and Ganoderma lucidum (Chen et al. 2012), which might be due to
51 the insufficient genetic or genomic resources available for D. zingiberensis.
52 In recent years, great advances in genome survey sequencing technology and bioinformatics have
53 opened a new avenue to characterize the genetic background of organisms, e.g., Myricarubra (Jiao
3
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 4 of 34
54 et al. 2012), Gracilariopsis lemaneiformis (Zhou et al. 2013), Fagopyrum tartaricum (Hou et al.
55 2016), and others. Compared with the conventional methods for gene cloning and sequencing, the
56 new generation sequencing technology affords a quick, easy, and full scale method of
57 investigation. To investigate and provide a genomic resource for further research (e.g., structural
58 and functional genomics studies, molecular cloning, comparative and evolutionary studies) on this
59 species, we conducted a genome survey of D. zingiberensis using NGS technology. This study
60 could pave the way for accelerating the progress of gene discovery and better utilization of the
61 existing genomic information in the future.
62 Materials and methods 63 Plant materials Draft 64 D. zingiberensis was collected from Xunyang County, Shaanxi Province, China. Voucher
65 specimens were prepared and identified by Prof. Tian Xianhua (College of Life Sciences, Shaanxi
66 Normal University, Xi’an, P. R. China) and then deposited at the Key Laboratory of Ministry of
67 Education for Medicinal Resources and Natural Pharmaceutical Chemistry, Shaanxi Normal
68 University. Young leaves were collected and frozen in liquid nitrogen and stored at –80°C prior to
69 genomic DNA extraction using the Plant Genomic DNA Kit (Tiangen biotech, Beijing, China)
70 following the manufacturer’s instructions. The extracts were electrophoresed on 1% agarose to
71 confirm the DNA quality and quantity. The concentrations of nucleic acids and proteins were
72 measured spectrophotometrically at 260 nm on a BioPhotometer (Eppendorf, Germany).
73 Genome size estimation by flow cytometry
82 Salvia miltiorrhiza, (1C = 0.66 pg DNA, (Zhang et al. 2015)) served as an internal reference
83 standard. One to two young leaves per plant, equivalent to 300 500 mg, were excised and placed
4
https://mc06.manuscriptcentral.com/genome-pubs Page 5 of 34 Genome
84 into a 100 mm Petri dish. To this, 1.5 mL of LB01 buffer (Dpooležel et al. 1989) was added, and
85 the two types of tissue were chopped simultaneously with a razor for 30 s (~60 chops per sample)
86 to release the nuclei. The resulting homogenate was filtered through a 48 µm nylon filter into a 1.5
87 mL tube. Then, the nuclear suspension was stained with 10 µL of PI (10 mg/mL), and 10 µL of
88 RnaseA (10 mg/mL) was added immediately to prevent the staining of a double standard RNA.
89 The samples were incubated on ice for 10 minutes. Then, the aqueous suspension of intact nuclei
90 from the samples and the internal reference DNA standard were analysed on a NovoCyte machine
91 (ACEA Biosciences, Inc.) with Novoexpress software (Version 1.2.4.1602). A green argon laser at
92 a wavelength of 488 nm was used as the light source, and the flow of at least 10000 nuclei was 93 measured in the sample. Draft 94 Genome sequencing and sequence assembly
95 Two paired end libraries with an insert size of 220 base pairs (bp) were constructed from
96 fragmented random genomic DNA following the manufacturer’s instructions (Illumina, Beijing,
97 China). Sequence data were generated by Beijing Biomarker Technologies Co., Ltd. (Beijing,
98 China) using an Illumina HiSeq 2500 sequencing platform. The short tips and low quality
99 sequences of the raw genome survey sequence data were filtered to obtain high quality reads,
100 which were subsequently used for assembly with SOAP de novo software (Li et al. 2010b). All
101 sequencing reads were deposited in the Short Read Archive (SRA) database
102 (http://www.ncbi.nlm.nih.gov/sra/), and they are retrievable under the accession number
103 SRX3235157.
104 Genome size estimation by k-mer analysis
105 In shotgun genome sequencing, short reads are assumed to be randomly generated, so any k mers
5
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 6 of 34
106 in the reads also occur randomly. Their depth of coverage follows the Poisson distribution (Li and
107 Waterman 2003), and the mean k mer depth should be equal to the peak value of the k mer depth
108 distribution. Two paired end libraries with insert sizes of approximately 220 bp and 500 bp were
109 sequenced on one lane of the Illumina HiSeq 2500 system with the paired end 150 bp. The
110 high quality Illumina sequences generated from these two genomic libraries were applied to k mer
111 counting using SOAPec (v2.01) in the SOAP de novo software package. Then, based on the K mer
112 analysis, information on the peak depth and the number of 17 mers was obtained. Thus, the size of
113 the genome and heterozygosis can be estimated using the following formula: Genome size =
114 K mernum / Peak depth relatively (Varshney et al. 2012). 115 Guanine plus cytosine (GC) content analysisDraft 116 The level of GC content is an important attribute of plant (and other living organisms) genomes.
117 The GC content is strictly controlled and moderately balanced across the genome (Parker et al.
118 2008). K mer sizes of 20, 37, 55, 63, 71, 77, 83, and 95 were examined using default parameters,
119 and the optimal k mer size was selected based on the N50 length. The usable reads > 200 bases in
120 length were selected to re align the contig sequences because the sequences < 200 bp were likely
121 to be derived from repetitive or low quality sequences (Lu et al. 2016). Finally, the GC average
122 sequencing depth was calculated by the 10 kb non overlapping sliding windows along the
123 assembled sequence.
124 SSR identification
125 The Perl script MIcro SAtellite (MISA) was used to identify microsatellites in D. zingiberensis
126 genomes (Thiel et al. 2003). We used MISA scripting language
127 (http://pgrc.ipk gatersleben.de/misa/misa.html) with default parameters to identify SSR in our
6
https://mc06.manuscriptcentral.com/genome-pubs Page 7 of 34 Genome
128 sequence database. Through the analysis of genome sequences, six types of SSR can be identified:
129 mono , di , tri , tetra , penta , and hexa nucleotide SSR.
130 Gene prediction and annotation
131 The raw survey data and transcriptome data (Hua et al. 2017) were used to predict and annotate
132 genes. After filtering the scaffolds of < 1000 bp in size, Gensan, a software that identifies
133 complete exon / intron structures of genes in genomic DNA, was applied to the gene identification
134 with parameters trained on D. Zingiberensis (Burge and Karlin 1997). Additionally, TransDecoder
135 v2.0 (Haas et al. 2013) and GeneMarkS T v5.1 (Tang et al. 2015) software were utilized to predict
136 genes according to the transcriptome database. Each predicted gene was annotated by BLAST 137 alignment to the GenBank database andDraft then analysed between predicted genes and common 138 databases such as plant Gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes
139 (KEGG), Eukaryotic clusters of Orthologous Groups (KOG), Nr, TrEMBL, and Clusters of
140 Orthologous Groups (COG). Meanwhile, the described genes were classified into the GO
141 categories and then mapped onto the KEGG reference pathways (Hirakawa et al. 2015).
142 Results
143 Genome size estimation by flow cytometry
144 The flow cytometric analyses yielded a high resolution histogram with CVs and mean values of
145 the tetraploid D. zingiberensis and the internal standard Salvia miltiorrhiza (Fig. 1). The CVs were
146 0.80% and 2.01% for D. zingiberensis and Salvia miltiorrhiza, respectively. The peak ratio was
147 calculated as 2.38, meaning the 1C value of D. zingiberensis was 0.87 pg (1.32 × 0.66 pg = 0.87
148 pg).
149 Genome sequencing and sequence assembly
7
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 8 of 34
150 A total of 31.48 Gb sequence data were generated from the small insert (220 bp) library, with 91.1%
151 Q30 bases (base quality > 30), which was required for successful assembly, approximately 78.70×
152 coverage (Table 1). A large N50 contig and contig number might simply reflect a continuous and
153 complete assembly (Li et al. 2012). The 31.48 Gb clean reads were used to conduct de novo
154 assembly. K mer sizes of 20, 37, 55, 63, 71, 77, 83, and 95 were examined using default
155 parameters. Assembly with k mer 63 by SOAP de novo was selected because it has the optimal
156 reading for N50 (Table 2), which is defined as a weighted median and is the smallest contig size in
157 the set with a combined length totalling 50% of the genome assembly (Carneiro et al. 2012; Earl et
158 al. 2011). The software SOAP de novo with k mer 63 produced a contig with the N50 of ~1.08 kb, 159 the longest contig length of ~19.83 kb,Draft and the total length of ~170.54 Mb (Table 2). A sequence 160 with a scaffold N50 length of ~1.96 kb, total length of ~173.46 Mb, and longest scaffold length of
161 ~40.16 kb (Table 2) was also generated. The total gap length (Ns) was ~2.91 Mb.
162 Genome size estimation
163 Based on the K mer analysis, a total of 27.24 Gb (Table 3) clean data were used to count and plot
164 the distribution of 17 mer frequency after filtering out the chloroplast sequencing data to estimate
165 the genome size of D. Zingiberensis. For the 17 mer frequency distribution (Figure 1), the average
166 K mer depth and the main peak of the depth was at ~57×. Likewise, the repeat peak was at the
167 position of the integer multiples of the main peak (~114×). The heterozygosis rate appeared at a
168 position of half of the height of the main peak (~28×), whereas the minor peak clearly appeared at
169 a position of a quarter of the height of the main peak (~15×). Thus, it was doubted to be
170 autotetraploid. According to the genetic background of this species, diploids and tetraploids occur
171 in nature (Huang et al. 2010). We deduced the sample to be autotetraploid. As a result, we
8
https://mc06.manuscriptcentral.com/genome-pubs Page 9 of 34 Genome
172 estimated the genome size to be 800.00 Mb, calculated by using the following algorithm: Genome
173 size = K mernum / Peak depth. The genome size of repetitive sequences was approximately 42.81%
174 of the D. zingiberensis genome, which was estimated to be 342.48 Mb. The heterozygosity
175 indicates approximately 1.37% belonging to the complex genome of the higher heterozygosis rate.
176 Guanine plus cytosine (GC) content analysis
177 To measure the genome wide sequencing bias, the GC content and average sequencing depth were
178 plotted using non overlapping 10 kb sliding windows along the assembled sequence (Figure 2).
179 The GC content of the genome varies in different plant species. A too high (>65%) and too low
180 (<25%) GC content may cause sequence bias on the Illumina sequencing platform, thus seriously 181 affecting genome assembly (Aird et al.Draft 2011). The average GC content of the D. zingiberensis 182 genome was 39.12% (Table 2),which was higher than for Arabidopsis thaliana (36%) (Barakat et
183 al. 1998) and potatoes (34.8 36.0%) (Consortium et al. 2011; Hirakawa et al. 2015) but lower than
184 that of some marine macroalgae, such as Cyanidio schyzonmerolae (55.0%) (Ohta et al. 2003),
185 Solieria filiformis (48.6%) (Dalmon and Loiseaux 1981), Chondrus crispus (46.3%) (Gall et al.
186 1993), and Laminaria hyperborea (42.6%) (Stam et al. 1988). Therefore, the D. zingiberensis
187 genome was of mid GC content. Moreover, the GC depth was slightly blocked into 4 layers
188 (Figure 2), which was in part caused by the polyploidy and the 1.37% high heterozygosity rate.
189 SSR identification
190 A total of 3,548,310 sequences were examined from the genome survey sequence containing
191 419,372 SSRs (Table 4). The mono nucleotide repeats showed a predominant type, which
192 accounted for 54.60% of the observed SSRs, followed by the di (29.60%), tri (11.37%), tetra
193 (3.53%), penta (0.65%) and hexa (0.25%) nucleotide repeat types (Table 4). Mono nucleotide
9
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 10 of 34
194 repeats have been reported to be the most common type of repeat whether in monocot species,
195 such as rice, sorghum, and Brachypodium, or in dicot species, for example, Arabidopsis, Medicago,
196 and Populus, which accounted for 79% in Medicago at most (Sonah et al. 2011).
197 In addition, 354 motif types were identified, comprising mono (4), di (8), tri (30), tetra (86),
198 penta (130), and hexa (96) nucleotide types (Table S1). Among the repeat motifs of the
199 dinucleotide, the TA/TA and AT/AT repeat were the most two abundant types, which accounted for
200 50.44% and 48.77%, respectively, followed by 19.21% CT/AG repeats (Figure 3). The
201 predominant motifs of the trinucleotide were ATT/AAT and TAA/TTA, accounting for 22.25% and
202 13.92%, respectively (Figure 4). 203 Gene prediction and annotation Draft 204 Based on the combination of genome sequencing and transcriptome data analysis of D.
205 zingiberensis, a total of 27,057 genes were predicted by Genescan and EVM (Altschul et al. 1990)
206 (Table 5). The average length of the putative genes identified was 911 bp, and the average exon
207 length and intron length were 544 bp and 236 bp, respectively. The putative genes were aligned by
208 Blast to the NR (Marchlerbauer et al. 2011), KOG (Tatusov et al. 2001), GO (Dimmer et al. 2012),
209 KEGG (Kanehisa and Goto 2000), and TrEMBL (B et al. 2003) databases, and 86.78% of the
210 putative genes were matched (Table 6).
211 GO annotations for the putative genes were obtained using the Blast2GO program (Conesa et al.
212 2005). Afterward, WEGO software (Ye et al. 2006) was applied to run GO functional
213 classifications for all genes and to understand the distribution of gene functions in D. zingiberensis
214 at the macro level. A total of 12,736 genes were identified by the GO slim analysis and further
215 classified into the categories of molecular function, cellular component, and biological process
10
https://mc06.manuscriptcentral.com/genome-pubs Page 11 of 34 Genome
216 (Figure 5). Specifically, 35.79%, 19.82%, and 44.39% of the genes were grouped under cellular
217 components, molecular functions, and biological processes, respectively. Furthermore, cell and cell
218 part (24.37% and 24.59%, respectively) were the most significantly represented groups within
219 cellular components; catalytic activity (47.18%) represented a relatively high proportion within
220 molecular functions; and metabolic process (20.66%) was the most highly represented group
221 within biological processes.
222 Altogether, 13,432 putative genes were classified into KOG functional categories. The largest
223 group was the cluster for general function prediction only (3,306; 24.61%), followed by signal
224 transduction mechanisms (1,567; 11.67%) and posttranslational modification, protein turnover, and 225 chaperones (1,269; 9.45%) (Figure S1).Draft 226 Pathway assignments were made according to KEGG mapping (Kanehisa and Goto 2000). There
227 were 7,406 putative genes assigned to 127 KEGG pathways (Table S3). A total of 4,473 genes
228 (60.40%) were associated with 96 metabolic pathways, in which 1,174 (26.25%) were involved in
229 carbohydrate metabolism; next was genetic information processing (1,945; 26.26%), with 382
230 associated with environmental information processing, 350 associated with cellular processes, and
231 251 associated with organismal systems.
232 Discussion
233 Genome size, also known as the genomic content or DNA 1C value,refers to the DNA content of
234 the gamete genome. Genome size is the basis for comparative and evolutionary genomics (Hirano
235 and Das 2012). We can comparatively analyse the genome size of different species and detect,
236 recognize and grasp the regularity of the genome variation. Flow cytometry has been regarded as a
237 standard method for the prediction of the genome size of plants, such as Lessingianthus
11
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 12 of 34
238 (Vernonieae, Asteraceae) (Angulo and Dematteis 2013), Capsicum (Solanaceae) (Moscone et al.
239 2003), and Bemisiatabaci (Aleyrodidae) (Brown et al. 2005). In addition, Feulgen
240 spectrophotometry (Ha et al. 2007) and pulsed field gel electrophoresis (PFGE) (Lingohr et al.
241 2009; Zhang and Fan 2002) have been proven to be effective methods to detect the genome size.
242 However, the development of NGS technologies have provided researchers with a more efficient
243 and affordable approach of proposing a wide range of problems relating to non model species.
244 Such a method has been applied to the analysis of the genomes of Brassica juncea (Yang et al.
245 2014), Myricarubra (Jiao et al. 2012), and Gracilariopsis lemaneiformis (Zhou et al. 2013).
246 Among the 200 250 species in the Dioscorea genus, D. zingiberensis is the most important in 247 terms of its high content of dioscinDraft with medicinal value. However, its limited genomic 248 information has constrained the genetic studies on D. Zingiberensis. This article offers a brief
249 description of the genome size of D. Zingiberensis, providing the variation of genome size
250 references in the Dioscorea genus. This is the first report of genome wide characterization in the
251 Dioscorea genus, and it evaluated the heterozygosity rate, GC content and distribution of the
252 genome. The main conclusions of the study are as follows.
253 (1) The genome size, heterozygosity rate and GC content of D. zingiberensis is approximately 800
254 Mb, 1.37% and 39.12%, respectively, as estimated by the K mer depth distribution of sequenced
255 reads. Compared with the other species in Dioscorea, the genome of D. Zingiberensis is relatively
256 small (Table 7). It is obvious that the genome size varied among different species in this genus. We
257 speculate the reason to be the varied extent of the amplification of repeat sequences occurring in
258 different species (Ohri and Khoshoo 1986; Wakamiya et al. 1993) and possible hybridization
259 between closely related taxa (Hall et al. 2000). The most likely explanation is the frequent
12
https://mc06.manuscriptcentral.com/genome-pubs Page 13 of 34 Genome
260 polyploidization events pushing the genome size variation during its evolution (Tian et al. 2008).
261 Furthermore, the result of the flow cytometric is consistent with the genome survey.
262 Arabidopsis has been documented as a model organism for genetic study, mainly because it has a
263 small genome (120 Mb) that is amenable to detailed molecular analysis (Meinke et al. 1998;
264 Meyerowitz and Pruitt 1985). In marine macroalgae, Pyropia was regarded as a model organism
265 for genetic studies, partly because its haploid has a relatively small genome size (270–530 Mb)
266 (YANG et al. 2011). The estimated genome size of D. zingiberensis (800 Mb) is much smaller than
267 other Dioscorea species, which shows the potential of D. zingiberensis as a model species in this
268 regard. 269 (2) The genome survey identified a totalDraft of 419,372 SSR from the D. zingiberensis genome. SSRs 270 in plant genomes have been surveyed in many species, and the numbers were quite different
271 among these sequenced plants, such as Oryza sativa (70,531), Arabidopsis thaliana (15,249) and
272 Sorghum bicolor (73,658) (Sonah et al., 2011). The SSR number in D. zingiberensis was nearly six
273 times greater than that in Oryza sativa. The frequency of each motif of the 354 polymorphic SSRs
274 is presented in Table S1. The TA/TA repeat was the most prominent type, accounting for 50.44%.
275 It is possible that the 419,372 derived SSR loci found in our study may be used as SSR markers for
276 genetic mapping in the short term.
277 (3) The number of genes predicted by the genome survey only of D. zingiberensis was much lower
278 than that of other sequenced genomes such as Rosa roxburghii Trat (Lu et al. 2016), Prunusmume
279 (Zhang et al. 2012), and Prunuspersica (Verde et al. 2013). The reason should be the insufficient
280 sequence depth coverage and low sequence homology due to limited gene information from
281 closely related species (Zhou et al. 2013). However, after combining with the transcriptome data,
13
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 14 of 34
282 the number of annotated genes was increased significantly.
283 (4) The whole genome shotgun (WGS) strategy is relatively difficult to assemble (Chen and
284 Pachter 2005). Instead, the complicated bacterial artificial chromosome (BAC) strategy could
285 resolve problems associated with the assembly of a heterozygous genome (Wu et al. 2013).
286 However, to ensure the genome integrity of the assembly, homozygous materials for genome
287 sequencing were always a priority (Shulaev et al. 2011; Xu et al. 2013).
288
289
290 Draft
14
https://mc06.manuscriptcentral.com/genome-pubs Page 15 of 34 Genome
291 Author Contributions
292 Conceived and designed the experiments: WZ BL. Performed the experiments: YCL WM.
293 Analysed the data: WZ LL. Contributed reagents/materials/analysis tools: SCF YCL. Wrote the
294 paper: WZ BL.
295
296 Data Availability Statement: The genome sequence reads obtained by Illumina Hiseq 2500 are
297 available at NCBI SRA. The Bioproject accession number is PRJNA391240
298 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA391240), and the Biosample accession number is
299 SAMN07259770 (https://www.ncbi.nlm.nih.gov/biosample/SAMN07259770). The Experiment 300 number is SRX3235157/Dioscorea zingiberensisDraft and Run number is SRR6122503. 301
302 Funding: This work was supported by the Fundamental Research Funds for the Central
303 Universities (2017CSZ008) and the National Natural Science Foundation of China (31670299).
15
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 16 of 34
304 Reference
305 Aird, D., Ross, M.G., Chen, W.S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D.B., Nusbaum, C., and 306 Gnirke, A. 2011. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. 307 Genome Biology 12(2): 1 14. 308 Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. 309 Journal of Molecular Biology 215(3): 403 410. 310 Angulo, M.B., and Dematteis, M. 2013. Nuclear DNA content in some species of Lessingianthus 311 (Vernonieae, Asteraceae) by flow cytometry. J Plant Res 126(4): 461 468. doi: 10.1007/s10265 012 0539 x. 312 Arumuganathan, K., and Earle, E.D. 1991. Nuclear DNA content of some important plant species. Plant 313 Molecular Biology Reporter 9(3): 208 218. 314 B, B., A, B., R, A., MC, B., A, E., E, G., MJ, M., K, M., C, O.D., I, P., S, P., and M, S. 2003. The Swiss Prot 315 knowledgebase and its supplement TREMBL in 2003. Nucleic Acids Research 31(1): 365. 316 Bai, Y., Zhang, L., Jin, W., Wei, M., Zhou, P., Zheng, G., Niu, L., Nie, L., Zhang, Y., and Wang, H. 2015. In 317 situ high valued utilization and transformation of sugars from Dioscorea zingiberensis C.H. Wright for clean 318 production of diosgenin. Bioresource Technology 196: 642. 319 Barakat, A., Matassi, G., and Bernardi, G. 1998. Distribution of Genes in the Genome of Arabidopsis 320 thaliana and Its Implications for the Genome Organization of Plants. Proceedings of the National Academy 321 of Sciences of the United States of America 95(17): 10044 10049. 322 Bertrand, J., Liagre, B., Bégaud Grimaud, G., Jauberteau, M.O., Beneytout, J.L., Cardot, P.J.P., and Battu, S. 323 2009. Analysis of relationship between Draft cell cycle stage and apoptosis induction in K562 cells by 324 sedimentation field flow fractionation. Journal of Chromatography B Analytical Technologies in the 325 Biomedical & Life Sciences 877(11 12): 1155. 326 Bharathan, G., Lambert, G., and Galbraith, D.W. 1994. Nuclear DNA Content of Monocotyledons and 327 Related Taxa. American Journal of Botany 81(3): 381 386. 328 Brown, J.K., Lambert, G.M., Ghanim, M., Czosnek, H., and Galbraith, D.W. 2005. Nuclear DNA content of 329 the whitefly Bemisia tabaci (Aleyrodidae: Hemiptera) estimated by flow cytometry. Bulletin of 330 Entomological Research 95(4): 309 312. 331 Burge, C., and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. Journal of 332 Molecular Biology 268(1): 78 94. 333 Carneiro, A.R., Ramos, R.T., Barbosa, H.P., Schneider, M.P., Barh, D., Azevedo, V., and Silva, A. 2012. 334 Quality of prokaryote genome assembly: Indispensable issues of factors affecting prokaryote genome 335 assembly quality. Gene 505(2): 365. 336 Chen, K., and Pachter, L. 2005. Bioinformatics for Whole Genome Shotgun Sequencing of Microbial 337 Communities. PLoS Comput. Biol. 1(2): 106. 338 Chen, S., Xu, J., Liu, C., Zhu, Y., Nelson, D.R., Zhou, S., Li, C., Wang, L., Guo, X., and Sun, Y. 2012. 339 Genome sequence of the model medicinal mushroom Ganoderma lucidum. Nature Communications 3(2): 340 913. 341 Conesa, A., Götz, S., Garcíagómez, J.M., Terol, J., Talón, M., and Robles, M. 2005. Blast2GO: a universal 342 tool for annotation, visualization and analysis in functional genomics research. Bioinformatics 21(18): 343 3674 3676. 344 Consortium, P.G.S., Xu, X., Pan, S., Cheng, S., Zhang, B., Mu, D., Ni, P., Zhang, G., Yang, S., and Li, R. 345 2011. Genome sequence and analysis of tuber crop potato. Nature 475(7355): 189 195. 346 Dalmon, J., and Loiseaux, S. 1981. The deoxyribonucleic acids of two brown algae: Pylaiella littoralis (L.) 347 Kjellm. and Sphacellaria sp. Plant Science Letters 21(3): 241 251. 16
https://mc06.manuscriptcentral.com/genome-pubs Page 17 of 34 Genome
348 Dimmer, E.C., Huntley, R.P., Alam Faruque, Y., Sawford, T., O'Donovan, C., Martin, M.J., Bely, B., Browne, 349 P., Chan, W.M., Eberhardt, R., Gardner, M., Laiho, K., Legge, D., Magrane, M., Pichler, K., Poggioli, D., 350 Sehra, H., Auchincloss, A., Axelsen, K., Blatter, M.C., Boutet, E., Braconi Quintaje, S., Breuza, L., Bridge, 351 A., Coudert, E., Estreicher, A., Famiglietti, L., Ferro Rojas, S., Feuermann, M., Gos, A., Gruaz Gumowski, 352 N., Hinz, U., Hulo, C., James, J., Jimenez, S., Jungo, F., Keller, G., Lemercier, P., Lieberherr, D., Masson, P., 353 Moinat, M., Pedruzzi, I., Poux, S., Rivoire, C., Roechert, B., Schneider, M., Stutz, A., Sundaram, S., 354 Tognolli, M., Bougueleret, L., Argoud Puy, G., Cusin, I., Duek Roggli, P., Xenarios, I., and Apweiler, R. 355 2012. The UniProt GO Annotation database in 2011. Nucleic Acids Research 40(D1): D565 D570. doi: 356 10.1093/nar/gkr1048. 357 Dpooležel, J., Binarová, P., and Lcretti, S. 1989. Analysis of Nuclear DNA content in plant cells by Flow 358 cytometry. Biologia Plantarum 31(2): 113 120. 359 Earl, D., Bradnam, K., St, J.J., Darling, A., Lin, D., Fass, J., Yu, H.O., Buffalo, V., Zerbino, D.R., and 360 Diekhans, M. 2011. Assemblathon 1: a competitive assessment of de novo short read assembly methods. 361 Genome Research 21(12): 2224. 362 Gall, Y.L., Brown, S., Marie, D., Mejjad, M., and Kloareg, B. 1993. Quantification of nuclear DNA and G C 363 content in marine macroalgae by flow cytometry of isolated nuclei. Protoplasma 173(3): 123 132. 364 Ha, S.H., Kim, J.B., Park, J.S., Lee, S.W., and Cho, K.J. 2007. A comparison of the carotenoid accumulation 365 in Capsicum varieties that show different ripening colours: deletion of the capsanthin capsorubin synthase 366 gene is not a prerequisite for the formation of a yellow pepper. 367 Haas, B.J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P.D., Bowden, J., Couger, M.B., Eccles, D., 368 Li, B., and Lieber, M. 2013. De novo transcriptDraft sequence reconstruction from RNA seq using the Trinity 369 platform for reference generation and analysis. Nature Protocols 8(8): 1494 1512. 370 Hall, S.E., Dvorak, W.S., Johnston, J.S., Price, H.J., and Williams, C.G. 2000. Flow Cytometric Analysis of 371 DNA Content for Tropical and Temperate New World Pines. Annals of Botany 86(6): 1081 1086. 372 Hamon, P., Brizard, J.P., Zoundjihékpon, J., Duperray, C., and Borgel, A. 2011. Étude des index d'ADN de 373 huit espèces d'ignames (Dioscorea sp.) par cy. Canadian Journal of Botany 70(5): 996 1000. 374 Hirakawa, H., Okada, Y., Tabuchi, H., Shirasawa, K., Watanabe, A., Tsuruoka, H., Minami, C., Nakayama, 375 S., Sasamoto, S., and Kohara, M. 2015. Survey of genome sequences in a wild sweet potato, Ipomoea trifida 376 (H. B. K.) G. Don. DNA Research 22(2): 171 179. 377 Hirano, M., and Das, S. 2012. Editorial [Hot Topic: Comparative Genomics and Genome Evolution (Guest 378 Editors: Sabyasachi Das and Masayuki Hirano)]. Current Genomics 13(2): . 379 Hou, S., Sun, Z., Linghu, B., Xu, D., Wu, B., Zhang, B., Wang, X., Han, Y., Zhang, L., and Qiao, Z. 2016. 380 Genetic Diversity of Buckwheat Cultivars ( Fagopyrum tartaricum Gaertn.) Assessed with SSR Markers 381 Developed from Genome Survey Sequences. Plant Molecular Biology Reporter 34(1): 233 241. 382 Hua, W., Kong, W., Cao, X.Y., Chen, C., Liu, Q., Li, X., and Wang, Z. 2017. Transcriptome analysis of 383 Dioscorea zingiberensis identifies genes involved in diosgenin biosynthesis. Genes & Genomics 39(5): 1 12. 384 Huang, H.P., Gao, S.L., Chen, L.L., and Wei, K.H. 2010. In vitro tetraploid induction and generation of 385 tetraploids from mixoploids in Dioscorea zingiberensis. Pharmacognosy Magazine 6(21): 51 56. 386 Jiao, Y., Jia, H.M., Li, X.W., Chai, M.L., Jia, H.J., Chen, Z., Wang, G.Y., Chai, C.Y., Weg, E.V.D., and Gao, 387 Z.S. 2012. Development of simple sequence repeat (SSR) markers from a genome survey of Chinese 388 bayberry ( Myrica rubra ). BMC Genomics 13(1): 201. 389 Kanehisa, M., and Goto, S. 2000. KEGG: Kyoto Encyclopaedia of Genes and Genomes. Nucleic Acids 390 Research volume 28(1): 27 30(24). 391 Li, D., Zhi, D., Bi, Q., Liu, X., Men, and Zhonghua. 2012. De novo assembly and characterization of bark 392 transcriptome using Illumina sequencing and development of EST SSR markers in rubber tree (Hevea 17
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 18 of 34
393 brasiliensis Muell. Arg.). BMC Genomics 13(1): 192. 394 Li, H., Huang, W., Wen, Y., Gong, G., Zhao, Q., and Yu, G. 2010a. Anti thrombotic activity and chemical 395 characterization of steroidal saponins from Dioscorea zingiberensis C.H. Wright. Fitoterapia 81(8): 396 1147 1156. doi: 10.1016/j.fitote.2010.07.016. 397 Li, R., Fan, W., Tian, G., Zhu, H., He, L., Cai, J., Huang, Q., Cai, Q., Li, B., Bai, Y., Zhang, Z., Zhang, Y., 398 Wang, W., Li, J., Wei, F., Li, H., Jian, M., Li, J., Zhang, Z., Nielsen, R., Li, D., Gu, W., Yang, Z., Xuan, Z., 399 Ryder, O.A., Leung, F.C., Zhou, Y., Cao, J., Sun, X., Fu, Y., Fang, X., Guo, X., Wang, B., Hou, R., Shen, F., 400 Mu, B., Ni, P., Lin, R., Qian, W., Wang, G., Yu, C., Nie, W., Wang, J., Wu, Z., Liang, H., Min, J., Wu, Q., 401 Cheng, S., Ruan, J., Wang, M., Shi, Z., Wen, M., Liu, B., Ren, X., Zheng, H., Dong, D., Cook, K., Shan, G., 402 Zhang, H., Kosiol, C., Xie, X., Lu, Z., Zheng, H., Li, Y., Steiner, C.C., Lam, T.T., Lin, S., Zhang, Q., Li, G., 403 Tian, J., Gong, T., Liu, H., Zhang, D., Fang, L., Ye, C., Zhang, J., Hu, W., Xu, A., Ren, Y., Zhang, G., 404 Bruford, M.W., Li, Q., Ma, L., Guo, Y., An, N., Hu, Y., Zheng, Y., Shi, Y., Li, Z., Liu, Q., Chen, Y., Zhao, J., 405 Qu, N., Zhao, S., Tian, F., Wang, X., Wang, H., Xu, L., Liu, X., Vinar, T., Wang, Y., Lam, T.W., Yiu, S.M., 406 Liu, S., Zhang, H., Li, D., Huang, Y., Wang, X., Yang, G., Jiang, Z., Wang, J., Qin, N., Li, L., Li, J., Bolund, 407 L., Kristiansen, K., Wong, G.K., Olson, M., Zhang, X., Li, S., Yang, H., Wang, J., and Wang, J. 2010b. The 408 sequence and de novo assembly of the giant panda genome. Nature 463(7279): 311 317. doi: 409 10.1038/nature08696. 410 Li, X., and Waterman, M.S. 2003. Estimating the Repeat Structure and Length of DNA Sequences Using 411 ℓ Tuples. Genome Research 13(8): 1916. 412 Liang, Xiao, Wang, Yang, Tian, Jinmin, Lian, Ruijuan, Yang, and Shumei. 2015. The Genome of 413 Dendrobium officinale Illuminates the BiologyDraft of the Important Traditional Chinese Orchid Herb. 分子植 414 物(英文版) 8(6): 922 934. 415 Lingohr, E., Frost, S., and Johnson, R.P. 2009. Determination of Bacteriophage Genome Size by 416 Pulsed Field Gel Electrophoresis. Humana Press. 417 Lu, M., An, H., and Li, L. 2016. Genome Survey Sequencing for the Characterization of the Genetic 418 Background of Rosa roxburghii Tratt and Leaf Ascorbate Metabolism Genes. Plos One 11(2): e0147530. 419 Marchlerbauer, A., Lu, S., Anderson, J.B., Chitsaz, F., Derbyshire, M.K., Deweesescott, C., Fong, J.H., Geer, 420 L.Y., Geer, R.C., and Gonzales, N.R. 2011. CDD: a Conserved Domain Database for the functional 421 annotation of proteins. Nucleic Acids Research 39(Database issue): D225. 422 Meinke, D.W., Cherry, J.M., Dean, C., Rounsley, S.D., and Koornneef, M. 1998. Arabidopsis thaliana: a 423 model plant for genome analysis. Science 282(5389): 679 682. 424 Meyerowitz, E.M., and Pruitt, R.E. 1985. Arabidopsis thaliana and Plant Molecular Genetics. Science 425 229(4719): 1214 1218. doi: 10.1126/science.229.4719.1214. 426 Minato, D., Li, B., Zhou, D., Shigeta, Y., Toyooka, N., Sakurai, H., Sugimoto, K., Nemoto, H., and Matsuya, 427 Y. 2013. Synthesis and antitumor activity of des AB analogue of steroidal saponin OSW 1. Tetrahedron 428 69(37): 8019 8024. doi: 10.1016/j.tet.2013.06.105. 429 Moscone, E.A., Baranyi, M., Ebert, I., Greilhuber, J., Ehrendorfer, F., and Hunziker, A.T. 2003. Analysis of 430 Nuclear DNA Content in Capsicum (Solanaceae) by Flow Cytometry and Feulgen Densitometry. Annals of 431 Botany 92(1): 21. 432 Obidiegwu, J.E., Rodriguez, E., Ene Obong, E.E., Loureiro, J., Muoneke, C.O., Santos, C., 433 Kolesnikova Allen, M., and Asiedu, R. 2009. Estimation of the nuclear DNA content in some representative 434 of genus Dioscorea. Scientific Research & Essays 4(5): 448 452. 435 Ohri, D., and Khoshoo, T.N. 1986. Genome size in gymnosperms. Plant Systematics and Evolution 153(1): 436 119 132. 437 Ohta, N., Matsuzaki, M., Misumi, O., Miyagishima, S.Y., Nozaki, H., Kan, T., Shin I, T., Kohara, Y., and 18
https://mc06.manuscriptcentral.com/genome-pubs Page 19 of 34 Genome
438 Kuroiwa, T. 2003. Complete Sequence and Analysis of the Plastid Genome of the Unicellular Red Alga 439 Cyanidioschyzon merolae. DNA Research 10(2): 67. 440 Parker, S.C.J., Margulies, E.H., and Tullius, T.D. 2008. THE RELATIONSHIP BETWEEN FINE SCALE 441 DNA STRUCTURE, GC CONTENT, AND FUNCTIONAL ELEMENTS IN 1% OF THE HUMAN 442 GENOME. In. p. 199. 443 Qin, Y., Wu, X., Huang, W., Gong, G., Li, D., He, Y., and Zhao, Y. 2009. Acute toxicity and sub chronic 444 toxicity of steroidal saponins from Dioscorea zingiberensis C.H.Wright in rodents. Journal of 445 Ethnopharmacology 126(3): 543 550. 446 Shulaev, V., Sargent, D.J., Crowhurst, R.N., Mockler, T.C., Folkerts, O., Delcher, A.L., Jaiswal, P., Mockaitis, 447 K., Liston, A., Mane, S.P., Burns, P., Davis, T.M., Slovin, J.P., Bassil, N., Hellens, R.P., Evans, C., Harkins, 448 T., Kodira, C., Desany, B., Crasta, O.R., Jensen, R.V., Allan, A.C., Michael, T.P., Setubal, J.C., Celton, J.M., 449 Rees, D.J.G., Williams, K.P., Holt, S.H., Rojas, J.J.R., Chatterjee, M., Liu, B., Silva, H., Meisel, L., Adato, 450 A., Filichkin, S.A., Troggio, M., Viola, R., Ashman, T.L., Wang, H., Dharmawardhana, P., Elser, J., Raja, R., 451 Priest, H.D., Bryant, D.W., Fox, S.E., Givan, S.A., Wilhelm, L.J., Naithani, S., Christoffels, A., Salama, D.Y., 452 Carter, J., Girona, E.L., Zdepski, A., Wang, W.Q., Kerstetter, R.A., Schwab, W., Korban, S.S., Davik, J., 453 Monfort, A., Denoyes Rothan, B., Arus, P., Mittler, R., Flinn, B., Aharoni, A., Bennetzen, J.L., Salzberg, 454 S.L., Dickerman, A.W., Velasco, R., Borodovsky, M., Veilleux, R.E., and Folta, K.M. 2011. The genome of 455 woodland strawberry (Fragaria vesca). Nat Genet 43(2): 109 116. doi: 10.1038/ng.740. 456 Sonah, H., Deshmukh, R.K., Sharma, A., Singh, V.P., Gupta, D.K., Gacche, R.N., Rana, J.C., Singh, N.K., 457 and Sharma, T.R. 2011. Genome Wide Distribution and Organization of Microsatellites in Plants: An Insight 458 into Marker Development in Brachypodium.Draft Plos One 6(6). doi: ARTN e21298 459 10.1371/journal.pone.0021298. 460 Stam, W.T., Bot, P.V.M., Boele Bos, S.A., Rooij, J.M.V., and Hoek, C.V.D. 1988. Single copy DNA DNA 461 hybridizations among five species ofLaminaria (Phaeophyceae): Phylogenetic and biogeographic 462 implications. Helgoland Marine Research 42(2): 251 267. 463 Tang, S., Lomsadze, A., and Borodovsky, M. 2015. Identification of protein coding regions in RNA 464 transcripts. Nucleic Acids Research 43(12): e78. 465 Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., 466 Galperin, M.Y., Fedorova, N.D., and Koonin, E.V. 2001. The COG database: new developments in 467 phylogenetic classification of proteins from complete genomes. Nucleic Acids Research 29(1): 22. 468 Thiel, T., Michalek, W., Varshney, R.K., and Graner, A. 2003. Exploiting EST databases for the development 469 and characterization of gene derived SSR markers in barley (Hordeum vulgare L.). Theoretical and Applied 470 Genetics 106(3): 411 422. 471 Tian, M., Ji Yuan, L.I., Sui, N.I., Fan, Z.Q., Xin Lei, and Li. 2008. Phylogenetic Study on Section Camellia 472 Based on ITS Sequences Data. Acta Horticulturae Sinica 35(11): 1685 1688. 473 Varshney, R.K., Chen, W., Li, Y., Bharti, A.K., Saxena, R.K., Schlueter, J.A., Donoghue, M.T., Azam, S., 474 Fan, G., and Whaley, A.M. 2012. Draft genome sequence of pigeonpea (Cajanus cajan), an orphan legume 475 crop of resource poor farmers. Nature Biotechnology 30(1): 83 89. 476 Verde, I., Abbott, A.G., Scalabrin, S., Jung, S., Shu, S., Marroni, F., Zhebentyayeva, T., Dettori, M.T., and 477 Grimwood, J. 2013. The high quality draft genome of peach (Prunus persica) identifies unique patterns of 478 genetic diversity, domestication and genome evolution. Nat Genet 45(5): 487 494. 479 Veselý, P., Bureš, P., Šmarda, P., and Pavlíček, T. 2012. Genome size and DNA base composition of 480 geophytes: the mirror of phenology and ecology? Annals of botany 109(1): 65. 481 Wakamiya, I., Newton, R.J., Johnston, J.S., and Price, H.J. 1993. Genome Size and Environmental Factors in 482 the Genus Pinus. American Journal of Botany 80(11): 1235 1241. 19
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 20 of 34
483 Wang, Y., Zhang, Y., Zhu, Z., Zhu, S., Li, Y., Li, M., and Yu, B. 2007. Exploration of the correlation between 484 the structure, hemolytic activity, and cytotoxicity of steroid saponins. Bioorganic & Medicinal Chemistry 485 15(7): 2528. 486 Wenping, H., Yuan, Z., Jie, S., Lijun, Z., and Zhezhi, W. 2011. De novo transcriptome sequencing in Salvia 487 miltiorrhiza to identify genes involved in the biosynthesis of active ingredients. Genomics 98(4): 272. 488 Wu, J., Wang, Z., Shi, Z., Zhang, S., Ming, R., Zhu, S., Khan, M.A., Tao, S., Korban, S.S., and Wang, H. 489 2013. The genome of the pear (Pyrus bretschneideri Rehd.). Genome Research 23(2): 396. 490 Xu, Q., Chen, L.L., Ruan, X., Chen, D., Zhu, A., Chen, C., Bertrand, D., Jiao, W.B., Hao, B.H., and Lyon, 491 M.P. 2013. The draft genome of sweet orange (Citrus sinensis). Nat Genet 45(1): 59. 492 YANG, Hui, and MAO. 2011. Profiling of the transcriptome of Porphyra yezoensis with Solexa sequencing 493 technology. Science Bulletin 56(20): 2119 2130. 494 Yang, J., Ning, S., Xuan, Z., Qi, X., Hu, Z., and Zhang, M. 2014. Genome survey sequencing provides clues 495 into glucosinolate biosynthesis and flowering pathway evolution in allotetrapolyploid Brassica juncea. BMC 496 Genomics 15(1): 107. 497 Ye, J., Fang, L., Zheng, H., Zhang, Y., Chen, J., Zhang, Z., Wang, J., Li, S., Li, R., and Bolund, L. 2006. 498 WEGO: a web tool for plotting GO annotations. Nucleic Acids Research 34(Web Server issue): 293 297. 499 Zhang, G., Yang, T., Jing, Z., Shu, L., Yang, S., Wen, W., Sheng, J., Yang, D., and Wei, C. 2015. Hybrid de 500 novo genome assembly of the Chinese herbal plant danshen ( Salvia miltiorrhiza Bunge). 501 GigaScience,4,1(2015 12 14) 4(1): 62. 502 Zhang, J.Z., and Fan, M.Y. 2002. Determination of genome size and restriction fragment length 503 polymorphism of four Chinese rickettsial isolatesDraft by pulsed field gel electrophoresis. Acta Virologica 46(1): 504 25 30. 505 Zhang, Q., Chen, W., Sun, L., Zhao, F., Huang, B., Yang, W., Tao, Y., Wang, J., Yuan, Z., and Fan, G. 2012. 506 The genome ofPrunus mume. Nature Communications 3(4): 1318. 507 Zhou, W., Hu, Y., Sui, Z., Fu, F., Wang, J., Chang, L., Guo, W., and Li, B. 2013. Genome Survey Sequencing 508 and Genetic Background Characterization of Gracilariopsis lemaneiformis (Rhodophyta) Based on 509 Next Generation Sequencing. Plos One 8(7): e69909. 510 Zonneveld, B.J., Leitch, I.J., and Bennett, M.D. 2005. First nuclear DNA amounts in more than 300 511 angiosperms. Annals of botany 96(2): 229 244.
512
20
https://mc06.manuscriptcentral.com/genome-pubs Page 21 of 34 Genome
513 Figure 1. The analysis of D. zingiberensis genome size of by flow cytometry.
514 Figure 2. Distribution of 17 mer frequency for estimating the genome size of
515 D. zingiberensis.
516 Figure 3. GC content and average sequencing depth of the genome data used
517 for assembly (the x axis was the GC content percent across every 10 kb
518 non overlapping sliding window).
519 Figure 4. Percentage of different motifs in dinucleotide repeats in D.
520 Zingiberensis.
521 Figure 5. Percentage of different motifs in trinucleotide repeats in D. 522 Zingiberensis. Draft 523 Figure 6. Gene Ontology classification. Genes were assigned to three
524 categories: cellular components, molecular functions, and biological
525 processes.
526
527 Supporting Information
528 1. Table S1. Occurrence of SSR motifs in Genome Survey to D.
529 Zingiberensis.(XLS)
530 2. Figure S2. Gene assignment to KOG functional categories in D.
531 Zingiberensis. (TIF)
532 3. Table S3. Number of genes mapped onto KEGG pathways. (XLS)
533
21
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 22 of 34
534 Table 1. Statistics of sequencing data. 535 Library (bp) Data (Gb) Depth (X) Q20 (%) Q30 (%) 220 31.48 78.7 95.08 91.1
536
537
Draft
22
https://mc06.manuscriptcentral.com/genome-pubs Page 23 of 34 Genome
538 Table 2. Statistics of the assembled genome sequences. 539 Contigs Number of sequences 334,288 Total length (bases) 174,634,115 N50 length (bases) 1,079 N90 length (bases) 219 Max length (bases) 19,831 GC content (%) 39.83 Scaffolds Number of sequences 3,548,310 Total length (bases) 177,618,332 N50 length (bases) 1,955 N90 length (bases) 1,110 Max length (bases) 40,163 A 281,503,836 T 271,128,258 G 174,939,581 C Draft 180,143,770 N 10,628,111 Total (ATGC) 907,715,445 G+C% (ATGC) 39.12
540
541
23
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 24 of 34
542 Table 3. Statistics of sequencing data after filtering out the chloroplast data 543 Library (bp) Data (Gb) Depth (X) Q20 (%) Q30 (%) 220 27.24 68.10 94.92 90.87
544
545
Draft
24
https://mc06.manuscriptcentral.com/genome-pubs Page 25 of 34 Genome
546 Table 4. Simple sequence repeat types detected in the D. zingiberensis sequences. 547 Searching Item Number Ratio Total number of sequences examined 3,548,310 Total size of examined sequences (bp) 918,343,556 Total number of identified SSRs 419,372 100% Number of SSR containing sequences 353,988 84.41% Number of sequences containing more than 1 SSR 50,592 12.06% Number of SSRs present in compound formation 33,439 7.97% Mono nucleotide 228,973 54.60% Di nucleotide 124,133 29.60% Tri nucleotide 47,681 11.37% Tetra nucleotide 14,815 3.53% Penta nucleotide 2,726 0.65% Hexa nucleotide 1,044 0.25%
548
549 Draft
25
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 26 of 34
550 Table 5. Statistics on gene information. 551 Software Gene Gene Average Exon Average Intron Average number gene exon intron Length (bp) EVM 27,057 24,659,071 911.37 14,722,7 544.14 6,388,8 236.12 44 07
552
553
Draft
26
https://mc06.manuscriptcentral.com/genome-pubs Page 27 of 34 Genome
554 Table 6. Statistics of gene functional annotation. 555 Annotation database Annotated number Percentage (%) GO 12,736 47.07 KOG 13,432 49.64 KEGG 8,395 31.03 NR 22,306 82.44 TrEMBL 22,024 81.40 All Annotated 2,915 86.78
556
557
Draft
27
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 28 of 34
558 Table 7. The genome size of the species in Dioscorea genus
Species Genome size(Mb) Original Reference
Dioscorea dumetorum 831.3 (Obidiegwu et al. 2009)
Dioscorea tokoro 430.3 (Veselý et al. 2012)
Dioscorea togoensis 469.4 (Hamon et al. 2011)
Dioscorea alata 567.2 (Arumuganathan and Earle 1991)
Dioscorea abyssinica 616.1 (Hamon et al. 2011)
Dioscorea mangenotiana 616.1 (Hamon et al. 2011)
Dioscorea praehensilis 616.1 (Hamon et al. 2011)
Dioscorearotundata 694.3Draft (Obidiegwu et al. 2009)
Dioscorea cayenensis 753.1 (Obidiegwu et al. 2009)
Dioscorea sylvatica 831.3 (Bharathan et al. 1994)
Dioscorea esculenta 1026.9 (Obidiegwu et al. 2009)
Dioscorea bulbifera 1173.6 (Obidiegwu et al. 2009)
Dioscorea villosa 2347.2 (Bharathan et al. 1994)
Dioscorea elephantipes 6601.5 (Zonneveld et al. 2005)
559
28
https://mc06.manuscriptcentral.com/genome-pubs Page 29 of 34 Genome
Draft
Figure 1. The analysis of D. zingiberensis genome size of by flow cytometry.
76x78mm (300 x 300 DPI)
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 30 of 34
Draft
Figure 2. Distribution of 17-mer frequency for estimating the genome size of D. zingiberensis.
67x57mm (300 x 300 DPI)
https://mc06.manuscriptcentral.com/genome-pubs Page 31 of 34 Genome
Draft
Figure 3. GC content and average sequencing depth of the genome data used for assembly (the x-axis was the GC content percent across every 10-kb non-overlapping sliding window).
67x53mm (300 x 300 DPI)
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 32 of 34
Draft
Figure 4. Percentage of different motifs in dinucleotide repeats in D. Zingiberensis.
72x52mm (300 x 300 DPI)
https://mc06.manuscriptcentral.com/genome-pubs Page 33 of 34 Genome
Figure 5. Percentage of differentDraft motifs in trinucleotide repeats in D. Zingiberensis.
126x77mm (300 x 300 DPI)
https://mc06.manuscriptcentral.com/genome-pubs Genome Page 34 of 34
Draft
Figure 6. Gene Ontology classification. Genes were assigned to three categories: cellular components, molecular functions, and biological processes.
https://mc06.manuscriptcentral.com/genome-pubs