Genome Survey Sequencing of Dioscorea Zingiberensis
Total Page:16
File Type:pdf, Size:1020Kb
Genome Genome Survey Sequencing of Dioscorea zingiberensis Journal: Genome Manuscript ID gen-2018-0011.R2 Manuscript Type: Article Date Submitted by the Author: 22-May-2018 Complete List of Authors: Zhou, Wen; Shaanxi Normal University, Life Science School Li, Bin; Shaanxi Normal University, Life Science School Li, Lin; Shaanxi Normal University, Life Science School Ma, wen; Shaanxi Normal University, Life Science School Liu, Yuanchu;Draft Shaanxi Normal University, Life Science School Feng, Shuchao; Shaanxi Normal University, Life Science School Wang, Zhezhi; Shaanxi Normal University, Life Science School Keyword: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis Is the invited manuscript for consideration in a Special N/A Issue? : https://mc06.manuscriptcentral.com/genome-pubs Page 1 of 34 Genome 1 Genome survey sequencing of Dioscorea zingiberensis 2 3 Wen Zhou+; Bin Li+; Lin Li; Wen Ma; Yuanchu Liu; Shuchao Feng; Zhezhi Wang* 4 5 1 Key Laboratory of the Ministry of Education for Medicinal Resources and Natural 6 Pharmaceutical Chemistry, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China 7 2 National Engineering Laboratory for Resource Development of Endangered Chinese Crude 8 Drugs in Northwest China, Shaanxi Normal University, Xi'an, Shaanxi 710119, P. R. China 9 10 + These authors contributed equally to Draftthis work. 11 12 *Correspondence: Prof. Zhe-Zhi WANG; [email protected]; Tel.: +86-29-85310260 1 https://mc06.manuscriptcentral.com/genome-pubs Genome Page 2 of 34 13 Abstract 14 Dioscorea zingiberensis (Dioscoreceae) is the main plant source of diosgenin (steroidal 15 sapogenins), the precursor for the production of steroid hormones in the pharmaceutical industry. 16 Despite its large economic value, genomic information of this Dioscorea genus is currently 17 unavailable. Here, we present an initial survey of the D. zingiberensis genome performed by 18 next-generation sequencing technology together with a genome size investigation inferred by flow 19 cytometry. The whole genome survey of D. zingiberensis generated 31.48 Gb of sequence data 20 with approximately 78.70× coverage. The estimated genome size is 800 Mb, with a high level of 21 heterozygosity based on K-mer analysis. These reads were assembled into 334,288 contigs with a 22 N50 length of 1,079 bp, which were furtherDraft assembled into 92,163 scaffolds with a total length of 23 173.46 Mb. A total of 4935 genes, 81 tRNAs, 69 rRNAs, and 661 miRNAs were predicted by the 24 genome analysis, and 263,484 repeated sequences were obtains with 419,372 simple sequence 25 repeats (SSRs). Among these SSRs, the mononucleotide repeat type was the most abundant (up to 26 54.60% of the total SSRs), followed by the dinucleotide (29.60%), trinucleotide (11.37%), 27 tetranucleotide (3.53%), pentanucleotide (0.65%), and hexanucleotide (0.25%) nucleotide repeat 28 types. The 1C-value of D. zingiberensis was calibrated against Salvia miltiorrhiza and calculated 29 as 0.87 pg (851 Mb) by flow cytometry, which was very close to the result of the genome survey. 30 This is the first report of genome-wide characterization within this taxon. 31 Key Words: Dioscorea zingiberensis, Genome survey sequencing, Genome analysis 2 https://mc06.manuscriptcentral.com/genome-pubs Page 3 of 34 Genome 32 Introduction 33 D. zingiberensis is an important and widely used medicinal herb in Traditional Chinese Medicine 34 (TCM). It has been applied for the treatment of various diseases, such as cough, anthrax, 35 rheumatoid arthritis, and sprains as well as cardiac diseases (Li et al. 2010a; Qin et al. 2009). 36 Plentiful diosgenin, a type of steroidal saponin extracted from the rhizomes of D. zingiberensis, is 37 an important steroidal precursor used in the pharmaceutical industry. In the medical industry, 38 diosgenin is widely used as the starting material for the synthesis of many steroidal drugs (e.g., 39 antioxidants, antiinflammatories, androgen, oestrogen, and contraceptives) due to the similarity in 40 their skeletons (Bertrand et al. 2009; Wang et al. 2007). More importantly, steroidal sapogenins 41 are attractive to many synthetic and medicinalDraft chemists aiming to harness their anticancer activity 42 (Minato et al. 2013). As the demand of the global market increases at 8% annually, steroid 43 hormones such as sexual hormones, cortical hormones, and protein anabolic hormones call for a 44 matching supply of the precursor to be produced (Bai et al. 2015). 45 At present, the extraction process of diosgenin from D. zingiberensis usually generates plenty of 46 high-acid and high-strength wastewater, which cannot be ignored as a great threat to the 47 environment. In consideration of this, microorganism bioengineering is an effective method for 48 producing diosgenin. However, genetic studies on D. zingiberensis remain underdeveloped 49 compared with many other herbs, such as Salvia miltiorrhiza (Wenping et al. 2011), Dendrobium 50 officinale (Liang et al. 2015), and Ganoderma lucidum (Chen et al. 2012), which might be due to 51 the insufficient genetic or genomic resources available for D. zingiberensis. 52 In recent years, great advances in genome survey sequencing technology and bioinformatics have 53 opened a new avenue to characterize the genetic background of organisms, e.g., Myricarubra (Jiao 3 https://mc06.manuscriptcentral.com/genome-pubs Genome Page 4 of 34 54 et al. 2012), Gracilariopsis lemaneiformis (Zhou et al. 2013), Fagopyrum tartaricum (Hou et al. 55 2016), and others. Compared with the conventional methods for gene cloning and sequencing, the 56 new generation sequencing technology affords a quick, easy, and fullscale method of 57 investigation. To investigate and provide a genomic resource for further research (e.g., structural 58 and functional genomics studies, molecular cloning, comparative and evolutionary studies) on this 59 species, we conducted a genome survey of D. zingiberensis using NGS technology. This study 60 could pave the way for accelerating the progress of gene discovery and better utilization of the 61 existing genomic information in the future. 62 Materials and methods 63 Plant materials Draft 64 D. zingiberensis was collected from Xunyang County, Shaanxi Province, China. Voucher 65 specimens were prepared and identified by Prof. Tian Xianhua (College of Life Sciences, Shaanxi 66 Normal University, Xi’an, P. R. China) and then deposited at the Key Laboratory of Ministry of 67 Education for Medicinal Resources and Natural Pharmaceutical Chemistry, Shaanxi Normal 68 University. Young leaves were collected and frozen in liquid nitrogen and stored at –80°C prior to 69 genomic DNA extraction using the Plant Genomic DNA Kit (Tiangen biotech, Beijing, China) 70 following the manufacturer’s instructions. The extracts were electrophoresed on 1% agarose to 71 confirm the DNA quality and quantity. The concentrations of nucleic acids and proteins were 72 measured spectrophotometrically at 260 nm on a BioPhotometer (Eppendorf, Germany). 73 Genome size estimation by flow cytometry 82 Salvia miltiorrhiza, (1C = 0.66 pg DNA, (Zhang et al. 2015)) served as an internal reference 83 standard. One to two young leaves per plant, equivalent to 300-500 mg, were excised and placed 4 https://mc06.manuscriptcentral.com/genome-pubs Page 5 of 34 Genome 84 into a 100 mm Petri dish. To this, 1.5 mL of LB01 buffer (Dpooležel et al. 1989) was added, and 85 the two types of tissue were chopped simultaneously with a razor for 30 s (~60 chops per sample) 86 to release the nuclei. The resulting homogenate was filtered through a 48 µm nylon filter into a 1.5 87 mL tube. Then, the nuclear suspension was stained with 10 µL of PI (10 mg/mL), and 10 µL of 88 RnaseA (10 mg/mL) was added immediately to prevent the staining of a double-standard RNA. 89 The samples were incubated on ice for 10 minutes. Then, the aqueous suspension of intact nuclei 90 from the samples and the internal reference DNA standard were analysed on a NovoCyte machine 91 (ACEA Biosciences, Inc.) with Novoexpress software (Version 1.2.4.1602). A green argon laser at 92 a wavelength of 488 nm was used as the light source, and the flow of at least 10000 nuclei was 93 measured in the sample. Draft 94 Genome sequencing and sequence assembly 95 Two paired-end libraries with an insert size of 220 base pairs (bp) were constructed from 96 fragmented random genomic DNA following the manufacturer’s instructions (Illumina, Beijing, 97 China). Sequence data were generated by Beijing Biomarker Technologies Co., Ltd. (Beijing, 98 China) using an Illumina HiSeq 2500 sequencing platform. The short tips and low quality 99 sequences of the raw genome survey sequence data were filtered to obtain high quality reads, 100 which were subsequently used for assembly with SOAP de novo software (Li et al. 2010b). All 101 sequencing reads were deposited in the Short Read Archive (SRA) database 102 (http://www.ncbi.nlm.nih.gov/sra/), and they are retrievable under the accession number 103 SRX3235157. 104 Genome size estimation by k-mer analysis 105 In shotgun genome sequencing, short reads are assumed to be randomly generated, so any k-mers 5 https://mc06.manuscriptcentral.com/genome-pubs Genome Page 6 of 34 106 in the reads also occur randomly. Their depth of coverage follows the Poisson distribution (Li and 107 Waterman 2003), and the mean k-mer depth should be equal to the peak value of the k-mer depth 108 distribution. Two paired-end libraries with insert sizes of approximately 220 bp and 500 bp were 109 sequenced on one lane of the Illumina HiSeq 2500 system with the paired-end 150 bp. The 110 high-quality Illumina sequences generated from these two genomic libraries were applied to k-mer 111 counting using SOAPec (v2.01) in the SOAP de novo software package. Then, based on the K-mer 112 analysis, information on the peak depth and the number of 17-mers was obtained.