Full-Length Genome of the Ogataea Polymorpha Strain HU-11 Reveals
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.04.17.440260; this version posted April 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 1 Full-length Genome of the Ogataea polymorpha strain HU-11 reveals 2 large duplicated segments in subtelomeic regions 3 4 Jia Chang1$, Jinlong Bei23$, Hemu Wang4, Jun Yang4, Xin Li1 5 Tung On Yau5, Wenjun Bu1, Jishou Ruan6, Guangyou Duan7*, Shan Gao1* 6 7 1. College of Life Sciences, Nankai University, Tianjin, Tianjin 300071, P.R.China. 8 2. Guangdong Provincial Key Laboratory for Crop Germplasm Resources Preservation and Utilization, 9 Agro-Biological Gene Research Center, Guangdong Academy of Agricultural Sciences, Guangzhou, 10 Guangdong 510640, P. R. China. 11 3. Guangdong Laboratory for Lingnan Modern Agriculture, Guangzhou, Guangdong 510642, P. R. China. 12 4. Tianjin Hemu Health Biotechnological Co., Ltd, Tianjin, Tianjin 300384, P.R.China 13 5. John Van Geest Cancer Research Centre, School of Science and Technology, Nottingham Trent 14 University, Nottingham, NG11 8NS, United Kingdom; 15 6. School of Mathematical Sciences, Nankai University, Tianjin, Tianjin 300071, P.R.China 16 7. School of Life Sciences, Qilu Normal University, Jinan, Shandong 250200, P.R.China. 17 18 19 20 21 $ These authors contributed equally to this paper. 22 * The corresponding authors. 23 SG:[email protected] 24 GD:[email protected] 25 - 1 - bioRxiv preprint doi: https://doi.org/10.1101/2021.04.17.440260; this version posted April 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 26 Abstract 27 28 Background: Currently, methylotrophic yeasts (e.g., Pichia pastoris, Hansenula polymorpha, and Candida 29 boindii) are subjects of intense genomics studies in basic research and industrial applications. In the genus 30 Ogataea, most research is focused on three basic O. polymorpha strains—CBS4732, NCYC495, and DL-1. 31 However, these three strains are of independent origin and unclear relationship. As a high-yield engineered 32 O. polymorpha strain, HU-11 can be regarded as identical to CBS4732, because the only difference between 33 them is a 5-bp insertion. 34 Results: In the present study, we have assembled the full-length genome of O. polymorpha HU-11 using 35 high-depth PacBio and Illumina data. Long terminal repeat (LTR) retrotransposons, rDNA, 5' and 3' 36 telomeric, subtelomeric, low complexity and other repeat regions were curated to improve the genome 37 quality. We took advantage of the full-length HU-11 genome sequence for the genome annotation and 38 comparison. Particularly, we determined the exact location of the rDNA genes and LTR retrotransposons in 39 seven chromosomes and detected large duplicated segments in the subtelomeic regions. Three novel 40 findings are: (1) the O. polymorpha NCYC495 is so phylogenetically similar to HU11 that a nearly 100% 41 of their genomes is covered by their syntenic regions, while NCYC495 is significantly distinct from DL-1; 42 (2) large segment duplication in subtelomeic regions is the main reason for genome expansion in yeasts; and 43 (3) the duplicated segments in subtelomeric regions may be integrated at telomeric tandem repeats (TRs) 44 through a molecular mechanism, which can be used to develop a simple and highly efficient genome editing 45 system to integrate or cleave large segments at telomeric TRs. 46 Conclusions: Our findings provide new opportunities for in-depth understanding of genome evolution in 47 methylotrophic yeasts and lay the foundations for the industrial applications of O. polymorpha CBS4732 48 and HU11. The full-length genome of the O. polymorpha strain HU-11 should be included into the NCBI 49 RefSeq database for future studies of O. polymorpha CBS4732 and its derivatives LR9, RB11 and HU-11. 50 51 52 53 54 55 56 57 58 59 60 61 62 Keywords: Methylotrophic yeast; Ogataea; CBS4732; rDNA quadruple; retrotransposon 63 - 2 - bioRxiv preprint doi: https://doi.org/10.1101/2021.04.17.440260; this version posted April 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 64 Introduction 65 Currently, methylotrophic yeasts (e.g., Pichia pastoris, Hansenula polymorpha, and Candida boindii) 66 are subjects of intense genomics studies in basic research and industrial applications. However, genomic 67 research on Ogataea (Hansenula) polymorpha trails behind that on P. pastoris [1], given that they both are 68 popular and widely used species of methylotrophic yeasts. In the genus Ogataea, most research is focused 69 on three basic O. polymorpha strains—CBS4732 (synonymous to NRRL-Y-5445 or ATCC34438), 70 NCYC495 (synonymous to NRRL-Y-1798, ATCC14754, or CBS1976), and DL-1 (synonymous to NRRL- 71 Y-7560 or ATCC26012). These three strains are of independent origin: CBS4732 was originally isolated 72 from soil irrigated with waste water from a distillery in Pernambuco, Brazil in 1959 [2]; NCYC495 is 73 identical to a strain first isolated from spoiled concentrated orange juice in Florida and initially designated as 74 Hansenula angusta by Wickerham in 1951 [3]; and DL-1 was isolated from soil by Levine and Cooney in 75 1973 [4]. Although it is known that CBS4732 and NCYC495 can be mated while DL-1 cannot be mated 76 with the either one of them, the relationship of all three strains is remains unclear [5]. CBS4732 and its 77 derivatives—LR9, RB11, and HU-11—have been developed as genetically engineered strains to produce 78 many heterologous proteins, including enzymes (e.g., feed additive phytase), anticoagulants (e.g., hirudin 79 and saratin), and an efficient vaccine against hepatitis B infection [5]. As a nutritionally deficient mutant 80 derived from CBS4732, the O. polymorpha strain HU-11 [6] is being used for high-yield production of 81 several important protein or peptides, particularly including recombinant hepatitis B surface antigen (HBsAg) 82 vaccine [7] and hirudin [8]. CBS4732 and HU-11 can be regarded as identical strains, as the only difference 83 between them is a 5-bp insertion caused by frame-shift mutation of its URA3 gene, which encodes orotidine 84 5'-phosphate decarboxylase. 85 To facilitate genomic research of yeasts, genome sequences have been increasingly submitted to the 86 Genome- NCBI datasets. Among the genomes of 34 species in the Ogataea or Candida genus 87 (Supplementary file 1), those of NCYC495 and DL-1 have been assembled at the chromosome level. 88 However, the other genomes have been assembled at the contig or scaffold level. Furthermore, the genome 89 sequence of CBS4732 was not available in the Genome- NCBI datasets until this manuscript was drafted. 90 Among the genomes of 33 Komagataella (Pichia) spp., the genome of the P. pastoris strain GS115 is the 91 only genome assembled at the chromosome level. The main problem of these Ogataea, Candida, or Pichia 92 genomes is their incomplete sequences and poor annotations. For example, the rDNA sequence (GenBank: 93 FN392325) of P. pastoris GS115 cannot be well aligned to its genome (Genbank assembly: 94 GCA_001708105). Most genome sequences do not contain complete subtelomeric regions and, as a result, 95 subtelomeres are often overlooked in comparative genomics [9]. For example, the genome of DL-1 has been 96 analyzed for better understanding the phylogenetics and molecular basis of O. polymorpha [1]; however, it 97 does not contain complete subtelomeric regions due to assembly using short sequences. Another problem of 98 current yeast genome data is that the complete sequences of mitochondrial genomes is not simultaneously - 3 - bioRxiv preprint doi: https://doi.org/10.1101/2021.04.17.440260; this version posted April 19, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. 99 released with those of nuclear genomes. The only complete mitochondrial genome in the NCBI GenBank 100 database is the O. polymorpha DL-1 mitochondrial genome (RefSeq: NC_014805). More high-quality 101 complete genome sequences of Ogataea spp. need to be sequenced to bridge the gap in Ogataea basic 102 research and industrial applications. 103 In the present study, we have assembled the full-length genome of O. polymorpha HU-11 using high- 104 depth PacBio and Illumina data and conducted the annotation and analysis to achieve the following research 105 goals: (1) to determine the relationship between CBS4732/HU-11, NCYC495, and DL-1; (2) to provide a 106 high-quality and well-curated reference genome for future studies of O. polymorpha CBS4732 and its 107 derivatives- LR9, RB11, and HU-11; and (3) to discover important genomic features (e.g., high yield) of 108 Ogataea spp. for basic research (e.g., synthetic biology) and industrial applications. 109 110 Results and Discussion 111 Genome sequencing, assembly and annotation 112 One 500 bp and one 10 Kbp DNA library each were prepared using fresh cells of O. polymorpha HU- 113 11 and sequenced on the Illumina HiSeq X Ten and PacBio Sequel platforms, respectively, to obtain the 114 high-quality genome sequences. Firstly, 18,319,084,791 bp cleaned PacBio DNA-seq data were used to 115 assembled the complete genome, except for the rDNA region (analyzed in further detail in subsequent 116 sections), with an extremely high depth of ~1800X. However, the assembled genome using high-depth 117 PacBio data still contained two types of error in the low complexity (Figure 1A) and the short tandem repeat 118 (STR) regions (Figure 1B). Then, 6,628,480,424 bp cleaned Illumina DNA-seq data were used to polish the 119 complete genome of HU-11 to remove the two types of error.