Comparative Genomics in the Era of Long-Reads an Application on Industrial Yeasts Salazar, A.N
Total Page:16
File Type:pdf, Size:1020Kb
Delft University of Technology Comparative genomics in the era of long-reads An application on industrial yeasts Salazar, A.N. DOI 10.4233/uuid:90594179-e599-4371-ac63-3fa800c53cc9 Publication date 2021 Document Version Final published version Citation (APA) Salazar, A. N. (2021). Comparative genomics in the era of long-reads: An application on industrial yeasts. https://doi.org/10.4233/uuid:90594179-e599-4371-ac63-3fa800c53cc9 Important note To cite this publication, please use the final published version (if applicable). Please check the document version above. Copyright Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim. This work is downloaded from Delft University of Technology. For technical reasons the number of authors shown on this cover page is limited to a maximum of 10. Comparative genomics in the era of long-reads An application on industrial yeasts Alex N. Salazar Comparative genomics in the era of long-reads An application on industrial yeasts Dissertation for the purpose of obtaining the degree of doctor at Delft University of Technology by the authority of the Rector Magnificus, Prof.dr.ir. T.H.J.J. van der Hagen chair of the Board of Doctorates to be defended publicly on Friday, 19 February 2021 at 10:00 o’clock by Alex N. Salazar Bachelor of Science in Bioengineering University of California, Santa Cruz, USA born in San Jose, California, USA This dissertation has been approved by the: promotor: Prof. dr. ir. M.J.T. Reinders copromotor: Dr. T. E. P. M. F. Abeel Composition of the doctoral committee: Rector Magnificus, chairperson Prof. dr. ir. M.J.T. Reinders, Delft University of Technology, promotor Dr. T. E. P. M. F. Abeel, Delft University of Technology, copromotor Independent members: Prof. dr. P. A. S. Daran, Delft University of Technology Prof. dr. A. Schoehnhut, Centrum Wiskunde & Informatica Prof. dr. B. Renard, University of Postdam Dr. J. A. Roubos, DSM Prof. dr. ir. J. Fostier, Ghent University, other member Prof. dr. R. C. H. J. van Ham, Delft University of Technology, reserve member The research presented in this dissertation was funded by the BE-Basic R&D Program (http://www.be-basic.org/), which was granted a TKI-subsidy subsidy from the Dutch Min- istry of Economic Affairs, Agriculture and Innovation (EL&I). Layout by Delft University of Technology, modified by Moritz Beller Copyright © 2021 by A.N. Salazar ISBN 978-90-9034-313-6 An electronic version of this dissertation is available at http://repository.tudelft.nl/. v ”The time will shortly come when the release of the complete sequence of a novel organism will no longer be a matter for excitement. The time will even come when students inbiology will have difficulty in imagining that, in the obscure past, there were organisms not yetfully sequenced! How could geneticists do their work then? How could they understand what they were doing to the parts when they were missing the whole?” —Bernard Dujon The yeast genome project: what did we learn? (1996) vii Contents Summary xi Samenvatting xiii Preface 1 0.1 A brief history of beer . 2 0.1.1 A taste for alcohol . 2 0.2 The evolution of beer. 5 0.3 Yeast: man’s best microbial friend . 10 1 Introduction 15 1.1 In the era of long-read genomic data . 17 1.1.1 On the fundamentals of sequence alignment . 18 1.1.2 De novo genome assembly: the early days . 23 1.1.3 Long-read sequence mapping and alignment . 27 1.1.4 Long-read de novo genome assembly . 32 1.1.5 Genomic fingerprints. 36 1.1.6 Microbial pan-genomes . 39 1.2 An overview of this thesis . 41 1.2.1 The case of the missing MAL gene . 41 1.2.2 Tracing genome mosaicism in microbial genomes . 41 1.2.3 Where do lager-yeast originate? . 41 1.2.4 A streaming algorithm to infer species-composition in Saccharomyces genomes . 42 1.2.5 How can one compare 푛 diverse microbial genome assemblies? . 42 1.2.6 Can we better educate microbiologists in bioinformatics? . 42 2 Nanopore sequencing enables near-complete de novo assembly of Sac- charomyces cerevisiae reference strain CEN.PK113-7D 45 2.1 Introduction . 45 2.2 Materials and Methods . 47 2.2.1 Yeast strains . 47 2.2.2 Yeast cultivation and genomic DNA extraction . 47 2.2.3 Short-read Illumina sequencing . 48 2.2.4 MinION sequencing . 48 2.2.5 De novo genome assembly . 48 2.2.6 Analysis of added information in the CEN.PK113-7D nanopore as- sembly . 49 2.2.7 Comparison of the CEN.PK113-7D assembly to the S288C genome . 50 2.2.8 Chromosome translocation analysis . 50 viii Contents 2.3 Results . 51 2.3.1 Sequencing on a single nanopore flow cell enables near-complete genome assembly . 51 2.3.2 Comparison of the nanopore and short-read assemblies of CEN.PK113- 7D . 52 2.3.3 Comparison of the nanopore assembly of CEN.PK113-7D to S288C. 54 2.3.4 Long-read sequencing data reveals chromosome structure hetero- geneity in CEN.PK113-7D Delft . 58 2.4 Discussion . 59 3 Alpaca: a kmer-based approach for investigating mosaic structures in microbial genomes 63 3.1 Introduction . 63 3.2 Method overview . 64 3.2.1 Alpaca foundations. 64 3.2.2 Alpaca implementation. 66 3.3 Runtime and conclusion . 69 4 Chromosome level assembly and comparative genome analysis confirm lager-brewing yeasts originated from a single hybridization 71 4.1 Introduction . 72 4.2 Methods . 74 4.2.1 Yeast strains, cultivation techniques and genomic DNA extraction . 74 4.2.2 Short-read Illumina sequencing . 75 4.2.3 Oxford nanopore minION sequencing and basecalling . 75 4.2.4 De novo genome assembly . 75 4.2.5 Comparison between ONT-only and Illumina-only genome assem- bly . 75 4.2.6 FLO gene analysis . 76 4.2.7 Intra-chromosomal heterozygosity . 76 4.2.8 Similarity analysis and lineage tracing of S. pastorianus sub-genomes using Alpaca . 76 4.3 Results . 77 4.3.1 Near-complete haploid assembly of CBS 1483 . 77 4.3.2 Comparison between Oxford nanopore minION and Illumina as- semblies . 78 4.3.3 Sequence heterogeneity in CBS 1483 . 82 4.3.4 Structural heterogeneity in CBS 1483 chromosomes . 83 4.3.5 Differences between Group 1 and 2 genomes do not result from separate ancestry. 84 4.4 Discussion . 88 4.5 Conclusion . 91 5 A streaming algorithm to infer species composition in Saccharomyces hybrid genomes 93 5.1 Introduction . 93 Contents ix 5.2 Methods . 95 5.2.1 The set-containment problem in the context of possible hybridiza- tion events from a phylogenetic tree . 95 5.2.2 Approximate fractional genome contribution calculations with Red- wood2 . 97 5.2.3 Benchmarking Redwood2 . 100 5.3 Results and discussion . 102 5.3.1 Saccharomyces sensu strictu tree construction. 103 5.3.2 Redwood2’s estimated species contributions are accurate in a sim- ulated benchmark . 104 5.3.3 Redwood2 provides informative global species estimations in pub- lic hybrid genomes . 107 5.3.4 Redwood2 limitations . 110 5.4 Conclusion . 111 6 Approximate, simultaneous comparison of microbial genome architec- tures via syntenic anchoring of quiver representations 113 6.1 Introduction . 114 6.2 Methods . 116 6.2.1 Synteny and the quiver representation of genomes . 116 6.2.2 Construction morphisms via syntenic anchors . 118 6.2.3 Canonical quiver construction . 120 6.2.4 Structural variant calling using quiver representations . 121 6.2.5 Ptolemy implementation . 122 6.2.6 Benchmark data . 122 6.3 Results . 122 6.3.1 Conserved genome architectures in MTBC . 123 6.3.2 Variable genome architectures in Yeast . 124 6.3.3 A genomic “melting-pot” in the Eco+Shig dataset. 126 6.3.4 Performance of Ptolemy . 126 6.4 Discussion . 127 6.5 Conclusion . 129 7 An educational guide for nanopore sequencing in the classroom 131 7.1 Introduction . 131 7.2 Bridging bioinformatics to biologists . 132 7.3 Integrating nanopore sequencing in the classroom . 133 7.4 Conclusion . 136 8 Discussion 139 8.1 Systematic variant calling from multi-whole genome alignments? . 140 8.2 The phasing of metagenomes. 141 Bibliography 143 Acknowledgments 181 Curriculum Vitæ 183 x Contents List of Publications 185 xi Summary We, humans, have an ancient microscopic companion: yeasts. These microbial organisms have helped shape our evolution, our civilizations, and our sciences. The evolutionary event that enabled yeasts to produce alcohol more than 100 million years ago was fol- lowed with adaptations throughout the animal kingdom to tolerate it. Our realisation that yeast could be used to produce bread, beer, and wine quickly enabled us to fuel the high, caloric need of many civilizations. An international dispute nearly two centuries ago about the biological nature of yeast in alcohol production, ultimately led to the founding of microbiology and the various medicinal benefits from its practice. And today, yeasts are the ‘Swiss Army knives of biotechnology’, as they are often engineered to produced cheaper therapeutics and alternative energy sources. Although an ancient companion, we have only begun to truly understand yeasts and their biotechnological capabilities, largely due to a new scientific instrument: genome sequencing technology.