Contig-Level Asembly of the Duboisia Myoporoides Genome
Total Page:16
File Type:pdf, Size:1020Kb
Contig-level asembly of the Duboisia myoporoides genome Joseph Wang1, Robert Henry1 1 The Queensland Alliance for Agriculture and Food Innovation (QAAFI), The University of Queensland, Brisbane De novo genome assembly Genome completeness De novo assesmbly was performed with three assemblers: canu, Me- De novo assembly is not a hyposis-based investigation, but its confidence cat2 and Falcon. All generated contigs of different size successful- can be assessed by determining how many genes from homologous species ly. Each assembly was given a code name for the sake of simplicity. are present in assembled contigs. By searching how many orthologs, i.e. BUSCOs are observed, genome completeness can be inferred (Figure 3). • canu assembly code-named 005 • Mecat2 assembly code-named 006 • Falcon assembly code-named 013 Three raw assemblies were evaluated according to their continuity and Figure 1. Duboisia plant flower (left) and branch (right). contig statistics (Table 2 and Table 3). Due to limited computational re- source, three de novo assemblies were configured to balance computation- al time and final data volume, thus the strikingly different performance. Why Duboisia Overview of assemblies results ID Assesmbler Size (Gbps) N50 (kbps) Contig number Duboisia is a genus of plants native to Australia (Figure 1). Like many 005 canu 2.517 1183 9457 other Solanaceae plants, Duboisia is rich in a variety of alkaloids such as nicotine, atropine and hyoscyamine. We studied the alkaloid bio- 006 Mecat2 2.110 651 8701 synthesis in the Duboisia genus by whole genome sequencing (WGS). 013 Falcon 1.604 443 7496 Table 2. Key statistics for three assemblies. Data specification Summary of contig statistics Metric 005 006 013 Duboisia DNA was extracted and sequenced on Illumina HiSeq and PacBio Sequel II platform. Both achieved read depths around 50-60X (Table 1). N L N L N L Figure 3. BUSCO plot of 005, 006 and 013. Raw assemblies were suffixed with 0 while Figure 6. All-versus-all alignment map between 005 and 006 visualized by Assemblytics. polioshed assemblies were suffixed with 1. Database solanales_odb10 was used. 10 12,017,012 15 3,146,453 48 1,733,512 74 Apart from self alignments where each contig consistently matches it- Sequencing data summary 20 5,526,062 44 1,889,731 136 1,220,728 185 BUSCO results demonstrated that all ssemblies were reasonably complete self and viturally nowhere else, no major inversions, duplications, trans- Platform Number of reads Read length Total bases 25 4,101,815 71 1,619,154 196 1,059,287 255 but contain a large number of duplicated sequences, especially after polish. locations or gaps were found as expected despite missing BUSCOs. Illumina 325059149 150bp x 2 97.52Gbps 30 3,107,326 105 1,348,196 268 886,415 338 PacBio 5585012 50bp to 247058bp 99.61Gbps 40 1,848,336 212 938,761 456 641,760 554 Table 1. Data volume for short reads and continuous long reads (CLR). 50 1,183,783 385 651,227 731 443,122 853 Pair-wise alignment Conclusion 60 683,265 663 456,244 1119 315,683 1282 MUMmer was applied to detect assembly artifacts by performing To sum up, we successfully assembled the Duboisia myoporoides ge- 70 403,431 1145 318,354 1676 215,801 1897 an all-versus-all alignment (Figure 3, 4 and 5). Self-aligning maps nome de novo. Canu, Mecat2 and Falcon assembled contigs sufficient- Genome size estimation 75 295,953 1510 260,728 2041 177,217 2307 are not included due to uniform results of a single straight diagonal. ly representing the target genome. Despite varying size and quality, all K-mer profile was built from short read-data with 27-mers (Fig- 80 210,464 2019 207,148 2494 145,585 2808 genes of interest were retrieved from assembled contigs with BLAST. Polishing with Illumina data or consensus calling from corrected CLR ure 2), suggesting that Duboisia myoporoides is a diploid, high- 90 95,916 3791 113,273 3855 86,209 4217 ly homozygous, sparsely duplicated plant with total genome size data substantially rectified errors resulted from de novo assembly. Table 3. Contig statistics for three assemblies. Minimum contig length and number of around 1.5Gbps (average genome size for plant is around 800Mbps). Variants identified from assembly confirmed low heterozygosity lev- contigs required to sum up a given quantile were calculated respectively as N and L. el. This experiment was a preliminary investigation into the Duboi- sia genus. CLR for other species, RNA-seq and Hi-C are next steps Three raw assemblies were then polished differently. Canu and Me- to fully assemble and annotate the Duboisia genome, followed by cat2 (005 and 006) were mapped against with accurate Illumina short- CRISPR gene editing or genetic engineering to enhance desirable traits. read data and polished with Pilon to resolve SNPs, INDELs and gaps. This step was iterated several times to determine optimal num- ber of polish steps. Contig statistics mostly remained unchanged with polishing and yet BUSCO results suggested heavy duplica- References tion with excessive polishing. On the other hand, Falcon assembly 1. Foley P. Duboisia myoporoides: The Medical Career of a Native Aus- was polished with its native companion, Arrow without iteration. tralian Plant. Historical Records of Australian Science. 2006;17(1):31- 69. 2. Jung H, Winefield C, Bombarely A, Prentis P, Waterhouse P. Tools Figure 4. All-versus-all alignment map between 005 and 013 visualized by Assemblytics. and Strategies for Long-Read Sequencing and De Novo Assembly of Plant Genomes. Trends in Plant Science. 2019;24(8):700-24. 3. Kohnen KL, Sezgin S, Spiteller M, Hagels H, Kayser O. Localiza- tion and Organization of Scopolamine Biosynthesis in Duboisia my- oporoides R. Br. Plant and Cell Physiology. 2017;59(1):107-18. 4. Xiao C-L, Chen Y, Xie S-Q, Chen K-N, Wang Y, Han Y, et al. MECAT: fast mapping, error correction, and de novo assembly for single-mole- cule sequencing reads. Nature Methods. 2017;14(11):1072-4. Acknowledgements • Special thanks to peers from QAAFI who performed the DNA ex- Figure 2. K-mer profile of Duboisia myoporoides produced by Jellyfish from Illumina traction. data, visualized with GenomeScope. First peak at 1C (13); second at 2C (25). • HPC resource was kindly provided by UQ Research Computing Centre (RCC). Figure 5. All-versus-all alignment map between 006 and 013 visualized by Assemblytics. .